Re: New Table Access Methods for Multi and Single Inserts

2021-01-17 Thread Luc Vlaming

On 17-01-2021 00:04, Jeff Davis wrote:



If we agree on removing heap_multi_insert_v2 API and embed that logic
inside heap_insert_v2, then we can do this - pass the required
information and the functions ExecInsertIndexTuples and
ExecARInsertTriggers as callbacks so that, whether or not
heap_insert_v2 choses single or multi inserts, it can callback these
functions with the required information passed after the flush. We
can
add the callback and required information into TableInsertState. But,
I'm not quite sure, we would make ExecInsertIndexTuples and
ExecARInsertTriggers.


How should the API interact with INSERT INTO ... SELECT? Right now it
doesn't appear to be integrated at all, but that seems like a fairly
important path for bulk inserts.

Regards,
Jeff Davis




Hi,

You mean how it could because of that the table modification API uses 
the table_tuple_insert_speculative ? Just wondering if you think if it 
generally cannot work or would like to see that path / more paths 
integrated in to the patch.


Kind regards,
Luc




[PATCH] More docs on what to do and not do in extension code

2021-01-17 Thread Craig Ringer
Hi folks

The attached patch expands the xfunc docs and bgworker docs a little,
providing a starting point for developers to learn how to do some common
tasks the Postgres Way.

It mentions in brief these topics:

* longjmp() based exception handling with elog(ERROR), PG_CATCH() and
PG_RE_THROW() etc
* Latches, spinlocks, LWLocks, heavyweight locks, condition variables
* shm, DSM, DSA, shm_mq
* syscache, relcache, relation_open(), invalidations
* deferred signal handling, CHECK_FOR_INTERRUPTS()
* Resource cleanup hooks and callbacks like on_exit, before_shmem_exit, the
resowner callbacks, etc
* signal handling in bgworkers

All very superficial, but all things I really wish I'd known a little
about, or even that I needed to learn about, when I started working on
postgres.

I'm not sure it's in quite the right place. I wonder if there should be a
separate part of xfunc.sgml that covers the slightly more advanced bits of
postgres backend and function coding like this, lists relevant README files
in the source tree, etc.

I avoided going into details like how resource owners work. I don't want
the docs to have to cover all that in detail; what I hope to do is start
providing people with clear references to the right place in the code,
READMEs, etc to look when they need to understand specific topics.
From 96ce89cb7e1a97d9d247fbacba73ade85a01ea38 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 14:05:15 +0800
Subject: [PATCH v1 2/2] Expand docs on PostgreSQL extension coding

Expand the docs on PostgreSQL extension coding and background worker
development a little so that key topics like allocation, interrupt
handling, exit callbacks, transaction callbacks, PG_TRY()/PG_CATCH(),
resource owners, transaction and snapshot state, etc are at least
briefly mentioned with a few pointers to where to learn more.

Add some detail on background worker signal handling.
---
 doc/src/sgml/bgworker.sgml |  86 ++--
 doc/src/sgml/xfunc.sgml| 162 -
 2 files changed, 239 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 7fd673ab54..9216b8e0ea 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -1,6 +1,6 @@
 
 
-
+
  Background Worker Processes
 
  
@@ -29,6 +29,12 @@
   
  
 
+ 
+  All code that will be executed in PostgreSQL background workers is expected
+  to follow the basic rules set out for all PostgreSQL extension code in .
+ 
+
  
   Background workers can be initialized at the time that
   PostgreSQL is started by including the module name in
@@ -95,6 +101,11 @@ typedef struct BackgroundWorker
buffers, or any custom data structures which the worker itself may
wish to create and use.
   
+  
+   See  for information on how to
+   request extension shared memory allocations, LWLocks,
+   etc.
+  
  
 
 
@@ -212,9 +223,9 @@ typedef struct BackgroundWorker
Signals are initially blocked when control reaches the
background worker's main function, and must be unblocked by it; this is to
allow the process to customize its signal handlers, if necessary.
-   Signals can be unblocked in the new process by calling
-   BackgroundWorkerUnblockSignals and blocked by calling
-   BackgroundWorkerBlockSignals.
+   It is important that all background workers set up and unblock signal
+   handling before they enter their main loops. Signal handling in background
+   workers is discussed separately in .
   
 
   
@@ -296,13 +307,74 @@ typedef struct BackgroundWorker
   
 
   
-   The src/test/modules/worker_spi module
-   contains a working example,
-   which demonstrates some useful techniques.
+   Background workers that require inter-process communication and/or
+   synchronisation should use PostgreSQL's built-in IPC and concurrency
+   features as discussed in 
+   and .
+  
+
+  
+   If a background worker needs to sleep or wait for activity it should
+   always use PostgreSQL's
+   interupt-aware APIs for the purpose. Do not usleep(),
+   system(), make blocking system calls, etc.
+  
+
+  
+   The src/test/modules/worker_spi and
+   src/test/modules/test_shm_mq contain working examples
+   that demonstrates some useful techniques.
   
 
   
The maximum number of registered background workers is limited by
.
   
+
+  
+   Signal Handling in Background Workers
+
+   
+Signals can be unblocked in the new process by calling
+BackgroundWorkerUnblockSignals and blocked by calling
+BackgroundWorkerBlockSignals.
+The default signal handlers installed for background workers do
+not interrupt queries or blocking calls into other postgres code
+so they are only suitable for simple background workers that frequently and
+predictably return control to their main loop. If your worker uses the
+default background worker signal handling it should call
+HandleMainLoopInterrupts() on each pass 

[PATCH] Cross-reference comments on signal handling logic

2021-01-17 Thread Craig Ringer
Hi all

The attached comments-only patch expands the signal handling section in
miscadmin.h a bit so that it mentions ProcSignal, deferred signal handling
during blocking calls, etc. It adds cross-refs between major signal
handling routines and the miscadmin comment to help readers track the
various scattered but inter-related code.

I hope this helps some new developers in future.
From a16d0a0f8f502ba3631d20d51c7bb50cedea6d57 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 12:21:18 +0800
Subject: [PATCH v1 1/2] Comments and cross-references for signal handling

To help new developers come to terms with the complexity of signal
handling in PostgreSQL's various types backend, expand the
main comment on signal handling in miscadmin.h. Cover the ProcSignal
mechanism for SIGUSR1 multiplexing. Mention that blocking calls into 3rd
party code and uninterruptible syscalls should be avoided in backend
code because of Pg's deferred signal processing logic. Provide a set of
cross-references to key routines related to signal handling.

In various signal handling routines, cross-reference the high level
overview comment in miscadmin.c.

This should possibly become part of the developer documentation rather
than a comment in a header file, but since we already have the comment,
it seemed sensible to start by updating it and making it more
discoverable.
---
 src/backend/postmaster/interrupt.c | 21 
 src/backend/tcop/postgres.c| 26 -
 src/include/miscadmin.h| 84 --
 src/port/pgsleep.c |  2 +-
 4 files changed, 113 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..e5256179ec 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -10,6 +10,15 @@
  *	  src/backend/postmaster/interrupt.c
  *
  *-
+ *
+ * This file defines bare-bones interrupt handlers for secondary helper
+ * processes run by the postmaster - the walwriter and bgwriter, and
+ * potentially some background workers.
+ *
+ * These handlers are NOT used by normal user backends as they do not support
+ * interruption of normal execution to respond to a signal, query cancellation,
+ * etc. See miscadmin.h for details on interrupt handling used by normal
+ * postgres backends - CHECK_FOR_INTERRUPTS(), ProcessInterrupts(), die(), etc.
  */
 
 #include "postgres.h"
@@ -28,6 +37,9 @@ volatile sig_atomic_t ShutdownRequestPending = false;
 
 /*
  * Simple interrupt handler for main loops of background processes.
+ *
+ * See also CHECK_FOR_INTERRUPTS() and ProcessInterrupts() for the user-backend
+ * variant of this function.
  */
 void
 HandleMainLoopInterrupts(void)
@@ -51,6 +63,8 @@ HandleMainLoopInterrupts(void)
  * Normally, this handler would be used for SIGHUP. The idea is that code
  * which uses it would arrange to check the ConfigReloadPending flag at
  * convenient places inside main loops, or else call HandleMainLoopInterrupts.
+ *
+ * Most backends use this handler.
  */
 void
 SignalHandlerForConfigReload(SIGNAL_ARGS)
@@ -67,6 +81,9 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
  * Simple signal handler for exiting quickly as if due to a crash.
  *
  * Normally, this would be used for handling SIGQUIT.
+ *
+ * See also quickdie() and die() for the separate signal handling logic
+ * used by normal user backends.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -99,6 +116,10 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
+ *
+ * See also die() for the extended version of this handler that's used in
+ * backends that may need to be interrupted while performing long-running
+ * actions.
  */
 void
 SignalHandlerForShutdownRequest(SIGNAL_ARGS)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cf1359e33..907b524e0f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2852,8 +2852,15 @@ quickdie(SIGNAL_ARGS)
 }
 
 /*
- * Shutdown signal from postmaster: abort transaction and exit
- * at soonest convenient time
+ * Shutdown signal from postmaster: set flags to ask ProcessInterrupts() to
+ * abort the current transaction and exit at the soonest convenient time.
+ *
+ * This handler is used as the SIGTERM handler by most backend types that may
+ * run arbitrary backend code, user queries, etc. Some backends use different
+ * handlers and sometimes different flags.
+ *
+ * See "System interrupt and critical section handling" in miscadmin.h for a
+ * higher level explanation of signal handling.
  */
 void
 die(SIGNAL_ARGS)
@@ -2924,6 +2931,9 @@ FloatExceptionHandler(SIGNAL_ARGS)
  * handling following receipt of SIGUSR1. Designed to be similar to die()
  * and 

RE: ResourceOwner refactoring

2021-01-17 Thread kuroda.hay...@fujitsu.com
Dear Heikki,

I apologize for sending again.

> I will check another ResourceOwnerEnlarge() if I have a time.

I checked all ResourceOwnerEnlarge() types after applying patch 0001.
(searched by "grep -rI ResourceOwnerEnlarge")
No problem was found.

Hayato Kuroda
FUJITSU LIMITED



Re: Lazy JIT IR code generation to increase JIT speed with partitions

2021-01-17 Thread Luc Vlaming

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!


I would like this topic to somehow progress and was wondering what other 
benchmarks / tests would be needed to have some progress? I've so far 
provided benchmarks for small(ish) queries and some tpch numbers, 
assuming those would be enough.




On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
I would like to propose a small patch to the JIT machinery which 
makes the
IR code generation lazy. The reason for postponing the generation of 
the IR

code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with 
their own
JITted functions, especially when e.g. partition-aware 
joins/aggregates are
enabled. However, only a fraction of those functions is actually 
executed
because the Parallel Append node distributes the workers among the 
nodes.
With the attached patch we get a lazy generation which makes that 
this is no

longer a problem.


I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
   SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY 
blub;

on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).


As far as I can see you've basically falsified the second part of this
comment (which you moved):


+
+    /*
+ * Don't immediately emit nor actually generate the function.
+ * instead do so the first time the expression is actually 
evaluated.
+ * That allows to emit a lot of functions together, avoiding a 
lot of
+ * repeated llvm and memory remapping overhead. It also helps 
with not
+ * compiling functions that will never be evaluated, as can be 
the case
+ * if e.g. a parallel append node is distributing workers 
between its

+ * child nodes.
+ */



-    /*
- * Don't immediately emit function, instead do so the first 
time the

- * expression is actually evaluated. That allows to emit a lot of
- * functions together, avoiding a lot of repeated llvm and memory
- * remapping overhead.
- */


Greetings,

Andres Freund



Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I 
just found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and 
then other solutions could be added on top.


Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed 
and not optimized several times (see [1]). Requires all code to be 
emitted in the same module.

2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would either 
need a shm capable optimized IR cache or would not work with parallel 
workers.

4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming 
there is a constant overhead for JITting and the goal is to gain runtime.


Going forward I would first try to see if my current approach can work 
out. The only idea that would be counterproductive to my solution 
would be solution 1. Afterwards I'd like to continue with either 
solution 2, 5, or 3 in the hopes that we can reduce JIT overhead to a 
minimum and can therefore apply it more broadly.


To test out why and where the JIT performance decreased with my 
solution I improved the test script and added various queries to model 
some of the cases I think we should care about. I have not (yet) done 
big scale benchmarks as these queries seemed to already show enough 
problems for now. Now there are 4 queries which test JITting 
with/without partitions, and with varying amounts of workers and 
rowcounts. I hope these are indeed a somewhat representative set of 
queries.


As pointed out the current patch does create a degradation in 
performance wrt queries that are not partitioned (basically q3 and 
q4). After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems 

Narrow the scope of the variable outputstr in logicalrep_write_tuple

2021-01-17 Thread japin

Hi,

I find that the outputstr variable in logicalrep_write_tuple() only use in
`else` branch, I think we can narrow the scope, just like variable outputbytes
in `if` branch (for more readable).

/*
 * Send in binary if requested and type has suitable send function.
 */
if (binary && OidIsValid(typclass->typsend))
{
bytea  *outputbytes;
int len;

pq_sendbyte(out, LOGICALREP_COLUMN_BINARY);
outputbytes = OidSendFunctionCall(typclass->typsend, values[i]);
len = VARSIZE(outputbytes) - VARHDRSZ;
pq_sendint(out, len, 4);/* length */
pq_sendbytes(out, VARDATA(outputbytes), len);   /* data */
pfree(outputbytes);
}
else
{
pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
pfree(outputstr);
}

Attached is a samll patch to fix it.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 62275ebabe..f2c85cabb5 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -493,7 +493,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar
HeapTuple   typtup;
Form_pg_type typclass;
Form_pg_attribute att = TupleDescAttr(desc, i);
-   char   *outputstr;

if (att->attisdropped || att->attgenerated)
continue;
@@ -537,6 +536,8 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar
}
else
{
+   char   *outputstr;
+
pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
pq_sendcountedtext(out, outputstr, strlen(outputstr), false);

Re: list of extended statistics on psql

2021-01-17 Thread Tatsuro Yamada

Hi Justin,

On 2021/01/18 1:52, Justin Pryzby wrote:

On Sun, Jan 17, 2021 at 03:31:57PM +0100, Tomas Vondra wrote:

I've reverted the commit - once we find the right way to handle this, I'll
get it committed again.


Please consider these doc changes for the next iteration.

commit 1a69f648ce6c63ebb37b6d8ec7c6539b3cb70787
Author: Justin Pryzby 
Date:   Sat Jan 16 17:47:35 2021 -0600

 doc review: psql \dX 891a1d0bca262ca78564e0fea1eaa5ae544ea5ee


Thanks for your comments!
It helps a lot since I'm not a native speaker.
I'll fix the document based on your suggestion on the next patch.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index aaf55df921..a678a69dfb 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1928,15 +1928,15 @@ testdb=
  is specified, only those extended statistics whose names match the
  pattern are listed.
  If + is appended to the command name, each extended
-statistics is listed with its size.
+statistic is listed with its size.


Agreed.

   

  
-The column of the kind of extended stats (e.g. Ndistinct) shows some 
statuses.
+The column of the kind of extended stats (e.g. Ndistinct) shows its 
status.
  "requested" means that it needs to collect statistics by ANALYZE.
  "built" means ANALYZE was


Agreed.



-finished, and the planner can use it. NULL means that it doesn't 
exists.
+run, and statistics are available to the planner. NULL means that it 
doesn't exist.



Agreed.


Thanks,
Tatsuro Yamada





RE: Stronger safeguard for archive recovery not to miss data

2021-01-17 Thread osumi.takami...@fujitsu.com
Hi, Laurenz


On Friday, January 15, 2021 12:56 AM Laurenz Albe  
wrote:
> On Tue, 2020-12-08 at 03:08 +, osumi.takami...@fujitsu.com wrote:
> > On Thursday, November 26, 2020 4:29 PM Kyotaro Horiguchi
> >  wrote:
> > > At Thu, 26 Nov 2020 07:18:39 +, "osumi.takami...@fujitsu.com"
> > >  wrote in
> > > > The attached patch is intended to prevent a scenario that archive
> > > > recovery hits WALs which come from wal_level=minimal and the
> > > > server continues to work, which was discussed in the thread of [1].
> > >
> > > Perhaps we need the TAP test that conducts the above steps.
> >
> > I added the TAP tests to reproduce and share the result, using the
> > case of 6-(1) described above.
> > Here, I created a new file for it because the purposes of other files
> > in src/recovery didn't match the purpose of my TAP tests perfectly.
> > If you are dubious about this idea, please have a look at the comments
> > in each file.
> >
> > When the attached patch is applied,
> > my TAP tests are executed like other ones like below.
> >
> > t/018_wal_optimize.pl  ok t/019_replslot_limit.pl
> > .. ok t/020_archive_status.pl .. ok
> > t/021_row_visibility.pl .. ok t/022_archive_recovery.pl
> >  ok All tests successful.
> >
> > Also, I confirmed that there's no regression by make check-world.
> > Any comments ?
> 
> The patch applies and passes regression tests, as well as the new TAP test.
Thank you for checking.

> I think this should be backpatched, since it fixes a bug.
Agreed.

> I am not quite happy with the message:
> 
> FATAL:  WAL was generated with wal_level=minimal, data may be missing
> HINT:  This happens if you temporarily set wal_level=minimal without taking a
> new base backup.
> 
> This sounds too harmless to me and doesn't give the user a clue what would be
> the best way to proceed.
> 
> Suggestion:
> 
> FATAL:  WAL was generated with wal_level=minimal, cannot continue
> recovering
Adopted.

> DETAIL:  This happens if you temporarily set wal_level=minimal on the primary
> server.
> HINT:  Create a new standby from a new base backup after setting
> wal_level=replica.
Thanks for your suggestion.
I noticed that this message should cover both archive recovery modes,
which means in recovery mode and standby mode. Then, I combined your
suggestion above with this point of view. Have a look at the updated patch.
I also enriched the new tap tests to show this perspective.

Best Regards,
Takamichi Osumi



stronger_safeguard_for_archive_recovery_v03.patch
Description: stronger_safeguard_for_archive_recovery_v03.patch


RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-17 Thread Tang, Haiying
> From: Amit Kapila 
> > Can we test cases when we have few rows in the Select table (say 
> > 1000) and there 500 or 1000 partitions. In that case, we won't 
> > select parallelism but we have to pay the price of checking 
> > parallel-safety of all partitions. Can you check this with 100, 200, 
> > 500, 1000 partitions table?
> 
> I also wanted to see such an extreme(?) case.  The 1,000 rows is not 
> the count per partition but the total count of all partitions.e.g., 
> when # of partitions is 100, # of rows per partition is 10.

Below results are in serial plan which select table total rows are 1,000. The 
Excution Time + Planning Time is still less than unpatched.
(does this patch make some optimizes in serial insert? I'm a little confused 
here, Because the patched execution time is less than unpatched, but I didn't 
find information in commit messages about it. If I missed something, please 
kindly let me know.)

   |patched|   master   
| %reg  |
---|--|||---|-|-|
partitions |Execution Time(ms)| Planning Time(ms)  | Execution Time(ms) | 
Planning Time(ms) | %reg(Excution Time) | %reg(all time)  |
---|--|||---|-|-|
100| 5.294|  1.581 |  6.951 |  
0.037|   -24%  | -2% |
200| 9.666|  3.068 |  13.681|  
0.043|   -29%  | -7% |
500| 22.742   |  12.061|  35.928|  
0.125|   -37%  | -3% |
1000   | 46.386   |  24.872|  75.523|  
0.142|   -39%  | -6% |

I did another test which made check overhead obvious. this case is not fitting 
for partition purpose, but I put it here as an extreme case too.
Select table total rows are 1,000, # of partitions is 2000. So only the first 
1000 partitions have 1 row per partition, the last 1000 partitions have no data 
inserted. 

   |patched|   master   
| %reg  |
---|--|||---|-|-|
partitions |Execution Time(ms)| Planning Time(ms)  | Execution Time(ms) | 
Planning Time(ms) | %reg(Excution Time) | %reg(all time)  |
---|--|||---|-|-|
2000   | 45.758   |  51.697|  80.272|  
0.136|   -43   | 21% |

Regards,
Tang




Re: list of extended statistics on psql

2021-01-17 Thread Tatsuro Yamada

Hi Tomas and Shinoda-san,

On 2021/01/17 23:31, Tomas Vondra wrote:



On 1/17/21 3:01 AM, Tomas Vondra wrote:

On 1/17/21 2:41 AM, Shinoda, Noriyoshi (PN Japan FSIP) wrote:

Hi, hackers.

I tested this committed feature.
It doesn't seem to be available to non-superusers due to the inability to 
access pg_statistics_ext_data.
Is this the expected behavior?



Ugh. I overlooked the test to check the case of the user hasn't Superuser 
privilege. The user without the privilege was able to access pg_statistics_ext. 
Therefore I supposed that it's also able to access pg_statics_ext_data. Oops.



Hmmm, that's a good point. Bummer we haven't noticed that earlier.

I wonder what the right fix should be - presumably we could do something like 
pg_stats_ext (we can't use that view directly, because it formats the data, so 
the sizes are different).

But should it list just the stats the user has access to, or should it list 
everything and leave the inaccessible fields NULL?



I've reverted the commit - once we find the right way to handle this, I'll get 
it committed again.

As for how to deal with this, I can think of about three ways:

1) simplify the command to only print information from pg_statistic_ext (so on 
information about which stats are built or sizes)

2) extend pg_stats_ext with necessary information (e.g. sizes)

3) create a new system view, with necessary information (so that pg_stats_ext 
does not need to be modified)

4) add functions returning the necessary information, possibly only for 
statistics the user can access (similarly to what pg_stats_ext does)

Options 2-4 have the obvious disadvantage that this won't work on older 
releases (we can't add views or functions there). So I'm leaning towards #1 
even if that means we have to remove some of the details. We can consider 
adding that for new releases, though.



Thanks for the useful advice. I go with option 1).
The following query is created by using pg_stats_ext instead of 
pg_statistic_ext and pg_statistic_ext_data. However, I was confused
about writing a part of the query for calculating MCV size because
there are four columns related to MCV. For example, most_common_vals, 
most_common_val_nulls, most_common_freqs, and most_common_base_freqs.
Currently, I don't know how to calculate the size of MCV by using the
four columns. Thoughts? :-)

===
\connect postgres hoge
create table hoge_t(a int, b int);
insert into hoge_t select i,i from generate_series(1,100) i;
create statistics hoge_t_ext on a, b from hoge_t;


SELECT
es.statistics_schemaname AS "Schema",
es.statistics_name AS "Name",
pg_catalog.format('%s FROM %s',
  (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(s.attname),', ')
   FROM pg_catalog.unnest(es.attnames) s(attname)),
es.tablename) AS "Definition",
CASE WHEN es.n_distinct IS NOT NULL THEN 'built'
 WHEN 'd' = any(es.kinds) THEN 'requested'
END AS "Ndistinct",
CASE WHEN es.dependencies IS NOT NULL THEN 'built'
 WHEN 'f' = any(es.kinds) THEN 'requested'
END AS "Dependencies",
CASE WHEN es.most_common_vals IS NOT NULL THEN 'built'
 WHEN 'm' = any(es.kinds) THEN 'requested'
END AS "MCV",
CASE WHEN es.n_distinct IS NOT NULL THEN
  
pg_catalog.pg_size_pretty(pg_catalog.length(es.n_distinct)::bigint)
 WHEN 'd' = any(es.kinds) THEN '0 bytes'
END AS "Ndistinct_size",
CASE WHEN es.dependencies IS NOT NULL THEN
  
pg_catalog.pg_size_pretty(pg_catalog.length(es.dependencies)::bigint)
 WHEN 'f' = any(es.kinds) THEN '0 bytes'
END AS "Dependencies_size"
FROM pg_catalog.pg_stats_ext es
ORDER BY 1, 2;

-[ RECORD 1 ]-+-
Schema| public
Name  | hoge_t_ext
Definition| a, b FROM hoge_t
Ndistinct | requested
Dependencies  | requested
MCV   | requested
Ndistinct_size| 0 bytes
Dependencies_size | 0 bytes

analyze hoge_t;

-[ RECORD 1 ]-+-
Schema| public
Name  | hoge_t_ext
Definition| a, b FROM hoge_t
Ndistinct | built
Dependencies  | built
MCV   | built
Ndistinct_size| 13 bytes
Dependencies_size | 40 bytes
===

Thanks,
Tatsuro Yamada







Re: list of extended statistics on psql

2021-01-17 Thread Tatsuro Yamada

Hi Tomas,

On 2021/01/17 8:32, Tomas Vondra wrote:

I've pushed this, keeping the "requested". If we decide that some other term is 
a better choice, we can tweak that later of course.

Thanks Tatsuro-san for the patience!


Thanks for taking the time to review the patches.
I believe this feature is useful for DBA when they use Extended stats.
For example, the execution plan tuning phase and so on.

Then, I know the patch was reverted. So, I keep going to fix the patch
on the Second iteration. :-D

Regards,
Tatsuro Yamada





Re: list of extended statistics on psql

2021-01-17 Thread Tatsuro Yamada

Hi Julien,

On 2021/01/15 17:47, Julien Rouhaud wrote:

Hello Yamada-san,

I reviewed the patch and don't have specific complaints, it all looks good!

I'm however thinking about the "requested" status.  I'm wondering if it could
lead to people think that an ANALYZE is scheduled and will happen soon.
Maybe "defined" or "declared" might be less misleading, or even "waiting for
analyze"?



Thanks for reviewing the patch.
Yeah, "waiting for analyze" was preferable but it was a little long to use on 
the columns. Unfortunately, I couldn't imagine a suitable term. Therefore,
I'm keeping it as is.

Regards,
Tatsuro Yamada





Re: Support for NSS as a libpq TLS backend

2021-01-17 Thread Michael Paquier
On Tue, Nov 17, 2020 at 04:00:53PM +0100, Daniel Gustafsson wrote:
> Nice, thanks for the fix!  I've incorporated your patch into the attached v20
> which also fixes client side error reporting to be more readable.  The SCRAM
> tests are now also hooked up, albeit with SKIP blocks for NSS, so they can
> start getting fixed.

On top of the set of TODO items mentioned in the logs of the patches,
this patch set needs a rebase because it does not apply.  In order to
move on with this set, I would suggest to extract some parts of the
patch set independently of the others and have two buildfarm members
for the MSVC and non-MSVC cases to stress the parts that can be
committed.  Just seeing the size, we could move on with:
- The ./configure set, with the change to introduce --with-ssl=openssl. 
- 0004 for strong randoms.
- Support for cryptohashes.

+/*
+ * BITS_PER_BYTE is also defined in the NSPR header files, so we need to undef
+ * our version to avoid compiler warnings on redefinition.
+ */
+#define pg_BITS_PER_BYTE BITS_PER_BYTE
+#undef BITS_PER_BYTE
This could be done separately.

src/sgml/libpq.sgml needs to document PQdefaultSSLKeyPassHook_nss, no?
--
Michael


signature.asc
Description: PGP signature


Re: Wrong usage of RelationNeedsWAL

2021-01-17 Thread Noah Misch
On Sun, Jan 17, 2021 at 10:36:31PM -0800, Noah Misch wrote:
> On Mon, Jan 18, 2021 at 03:08:38PM +0900, Kyotaro Horiguchi wrote:
> > At Fri, 15 Jan 2021 20:38:16 -0800, Noah Misch  wrote in 
> > > On Wed, Jan 13, 2021 at 04:07:05PM +0900, Kyotaro Horiguchi wrote:
> > > > --- a/src/include/utils/snapmgr.h
> > > > +++ b/src/include/utils/snapmgr.h
> > > > @@ -37,7 +37,7 @@
> > > >   */
> > > >  #define RelationAllowsEarlyPruning(rel) \
> > > >  ( \
> > > > -RelationNeedsWAL(rel) \
> > > > +RelationIsWalLogged(rel) \
> > > 
> > > I suspect this is user-visible for a scenario like:
> > > 
> > > CREATE TABLE t AS SELECT ...; DELETE FROM t;
> > > -- ... time passes, rows of t are now eligible for early pruning ...
> > > BEGIN;
> > > ALTER TABLE t SET TABLESPACE something;  -- start skipping WAL
> > > SELECT count(*) FROM t;
> > > 
> > > After this patch, the SELECT would do early pruning, as it does in the 
> > > absence
> > > of the ALTER TABLE.  When pruning doesn't update the page LSN,
> > > TestForOldSnapshot() will be unable to detect that early pruning changed 
> > > the
> > > query results.  Hence, RelationAllowsEarlyPruning() must not change this 
> > > way.
> > > Does that sound right?
> > 
> > Mmm, maybe no. The pruning works on XID (or timestamp), not on LSN, so
> > it seems to work well even if pruning happened at the SELECT.
> > Conversely that should work after old_snapshot_threshold elapsed.
> > 
> > Am I missing something?
> 
> I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check in
> TestForOldSnapshot().  If the LSN isn't important, what else explains
> RelationAllowsEarlyPruning() checking RelationNeedsWAL()?

Thinking about it more, some RelationAllowsEarlyPruning() callers need
different treatment.  Above, I was writing about the case of deciding whether
to do early pruning.  The other RelationAllowsEarlyPruning() call sites are
deciding whether the relation might be lacking old data.  For that case, we
should check RELPERSISTENCE_PERMANENT, not RelationNeedsWAL().  Otherwise, we
could fail to report an old-snapshot error in a case like this:

setup: CREATE TABLE t AS SELECT ...;
xact1: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1;  -- take snapshot
xact2: DELETE FROM t;
(plenty of time passes)
xact3: SELECT count(*) FROM t;  -- early pruning
xact1: SAVEPOINT q; SELECT count(*) FROM t; ROLLBACK TO q;  -- "snapshot too 
old"
xact1: ALTER TABLE t SET TABLESPACE something;  -- start skipping WAL
xact1: SELECT count(*) FROM t;  -- no error, wanted "snapshot too old"

Is that plausible?




RE: POC: postgres_fdw insert batching

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
Tomas-san,

From: Amit Langote 
> Good thing you reminded me that this is about inserts, and in that
> case no, ExecInitModifyTable() doesn't know all leaf partitions, it
> only sees the root table whose batch_size doesn't really matter.  So
> it's really ExecInitRoutingInfo() that I would recommend to set
> ri_BatchSize; right after this block:
> 
> /*
>  * If the partition is a foreign table, let the FDW init itself for
>  * routing tuples to the partition.
>  */
> if (partRelInfo->ri_FdwRoutine != NULL &&
> partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
> partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
> 
> Note that ExecInitRoutingInfo() is called only once for a partition
> when it is initialized after being inserted into for the first time.
> 
> For a non-partitioned targets, I'd still say set ri_BatchSize in
> ExecInitModifyTable().

Attached is the patch that added call to GetModifyBatchSize() to the above two 
places.  The regression test passes.

(FWIW, frankly, I prefer the previous version because the code is a bit 
smaller...  Maybe we should refactor the code someday to reduce similar 
processings in both the partitioned case and non-partitioned case.)


Regards
Takayuki Tsunakawa



v10-0001-Add-bulk-insert-for-foreign-tables.patch
Description: v10-0001-Add-bulk-insert-for-foreign-tables.patch


Re: Deleting older versions in unique indexes to avoid page splits

2021-01-17 Thread Amit Kapila
On Mon, Jan 18, 2021 at 12:43 AM Peter Geoghegan  wrote:
>
> On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila  wrote:
> > Do we do this optimization (bottom-up deletion) only when the last
> > item which can lead to page split has indexUnchanged set to true? If
> > so, what if the last tuple doesn't have indexUnchanged but the
> > previous ones do have?
>
> Using the indexUnchanged hint as a triggering condition makes sure
> that non-HOT updaters are the ones that pay the cost of a bottom-up
> deletion pass. We create backpressure for non-HOT updaters (in indexes
> that are logically unchanged) specifically. (Of course it's also true
> that the presence of the indexUnchanged hint is highly suggestive of
> there being more version churn duplicates on the leaf page already,
> which is not actually certain.)
>
> It's possible that there will be some mixture of inserts and non-hot
> updates on the same leaf page, and as you say the implementation might
> fail to do a bottom-up pass based on the fact that an incoming item at
> the point of a would-be page split was a plain insert (and so didn't
> receive the hint).
>

or it would do the scan when that is the only time for this leaf page
to receive such a hint.

> The possibility of these kinds of "missed
> opportunities" are okay because a page split was inevitable either
> way. You can imagine a case where that isn't true (i.e. a missed
> opportunity to avoid a page split), but that's kind of like imagining
> a fair coin flip taking place 100 times and coming up heads each time.
> It just isn't realistic for such an "mixed tendencies" leaf page to
> stay in flux (i.e. not ever split) forever, with the smartest
> triggering condition in the world -- it's too unstable.
>

But even if we don't want to avoid it forever delaying it will also
have advantages depending upon the workload. Let's try to see by some
example, say the item and page size are such that it would require 12
items to fill the page completely. Case-1, where we decide based on
the hint received in the last item, and Case-2 where we decide based
on whether we ever received such a hint for the leaf page.

Case-1:

12 new items (6 inserts 6 non-HOT updates)
Page-1: 12 items - no split would be required because we received the
hint (indexUnchanged) with the last item, so page-1 will have 6 items
after clean up.

6 new items (3 inserts, 3 non-HOT updates)
Page-1: 12 items (9 inserts, 3 non-HOT updates) lead to split because
we received the last item without a hint.

The SPLIT-1 happens after 18 operations (6 after the initial 12).

After this split (SPLIT-1), we will have 2 pages with the below state:
Page-1: 6 items (4 inserts, 2 non-HOT updates)
Page-2: 6 items (5 inserts, 1 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items
(5 inserts, 4 non-HOT updates)
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(7 inserts, 2 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12
items (6 inserts, 6 non-HOT updates) doesn't lead to split
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(9 inserts, 3 non-HOT updates) lead to split (split-2)

The SPLIT-2 happens after 30 operations (12 new operations after the
previous split).

Case-2:

12 new items (6 inserts 6 non-HOT updates)
Page-1: 12 items - no split would be required because we received the
hint (indexUnchanged) with at least one of the item, so page-1 will
have 6 items after clean up.

6 new items (3 inserts, 3 non-HOT updates)
Page-1: 12 items (9 inserts, 3 non-HOT updates), cleanup happens and
Page-1 will have 9 items.

6 new items (3 inserts, 3 non-HOT updates), at this stage in whichever
order the new items are received one split can't be avoided.

The SPLIT-1 happens after 24 new operations (12 new ops after initial 12).
Page-1: 6 items (6 inserts)
Page-2: 6 items (6 inserts)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items
(7 inserts, 2 non-HOT updates)
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(8 inserts, 1 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12
items (8 inserts, 4 non-HOT updates) clean up happens and page-1 will
have 8 items.
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(10 inserts, 2 non-HOT updates) clean up happens and page-2 will have
10 items.

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items, 1 insert and 2 non-HOT updates; Page-1: 11
items (9 inserts, 2 non-HOT updates)
Page-2 got 3 new items, 2 inserts and 1 non-HOT update; Page-2: 12
items (12 inserts, 0 non-HOT updates) cleanup happens for one of the
non-HOT updates

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items, 1 insert, and 2 non-HOT 

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Bharath Rupireddy
On Mon, Jan 18, 2021 at 11:58 AM Fujii Masao
 wrote:
>
>
>
> On 2021/01/18 15:02, Hou, Zhijie wrote:
> >> We need to create the loopback3 with user mapping public, otherwise the
> >> test might become unstable as shown below. Note that loopback and
> >> loopback2 are not dropped in the test, so no problem with them.
> >>
> >>   ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');  DROP
> >> SERVER loopback3 CASCADE;
> >>   NOTICE:  drop cascades to 2 other objects
> >> -DETAIL:  drop cascades to user mapping for postgres on server loopback3
> >> +DETAIL:  drop cascades to user mapping for bharath on server loopback3
> >>
> >> Attaching v2 patch for postgres_fdw_get_connections. Please have a look.
> > Hi
> >
> > I have a comment for the doc about postgres_fdw_get_connections.
> >
> > +postgres_fdw_get_connections(OUT server_name text, OUT 
> > valid boolean) returns setof record
> > +
> > + 
> > +  This function returns the foreign server names of all the open
> > +  connections that postgres_fdw established from
> > +  the local session to the foreign servers. It also returns whether
> > +  each connection is valid or not. false is returned
> > +  if the foreign server connection is used in the current local
> > +  transaction but its foreign server or user mapping is changed or
> > +  dropped, and then such invalid connection will be closed at
> > +  the end of that transaction. true is returned
> > +  otherwise. If there are no open connections, no record is returned.
> > +  Example usage of the function:
> >
> > The doc seems does not memtion the case when the function returns NULL in 
> > server_name.
> > Users may be a little confused about why NULL was returned.
>
> Yes, so what about adding
>
>  (Note that the returned server name of invalid connection is NULL if its 
> server is dropped)
>
> into the following (just after "dropped")?
>
> +  if the foreign server connection is used in the current local
> +  transaction but its foreign server or user mapping is changed or
> +  dropped
>
> Or better description?

+1 to add it after "dropped (Note )", how about as follows
with slight changes?

dropped (Note that server name of an invalid connection can be NULL if
the server is dropped), and then such .

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Wrong usage of RelationNeedsWAL

2021-01-17 Thread Noah Misch
On Mon, Jan 18, 2021 at 03:08:38PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 15 Jan 2021 20:38:16 -0800, Noah Misch  wrote in 
> > On Wed, Jan 13, 2021 at 04:07:05PM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/include/utils/snapmgr.h
> > > +++ b/src/include/utils/snapmgr.h
> > > @@ -37,7 +37,7 @@
> > >   */
> > >  #define RelationAllowsEarlyPruning(rel) \
> > >  ( \
> > > -  RelationNeedsWAL(rel) \
> > > +  RelationIsWalLogged(rel) \
> > 
> > I suspect this is user-visible for a scenario like:
> > 
> > CREATE TABLE t AS SELECT ...; DELETE FROM t;
> > -- ... time passes, rows of t are now eligible for early pruning ...
> > BEGIN;
> > ALTER TABLE t SET TABLESPACE something;  -- start skipping WAL
> > SELECT count(*) FROM t;
> > 
> > After this patch, the SELECT would do early pruning, as it does in the 
> > absence
> > of the ALTER TABLE.  When pruning doesn't update the page LSN,
> > TestForOldSnapshot() will be unable to detect that early pruning changed the
> > query results.  Hence, RelationAllowsEarlyPruning() must not change this 
> > way.
> > Does that sound right?
> 
> Mmm, maybe no. The pruning works on XID (or timestamp), not on LSN, so
> it seems to work well even if pruning happened at the SELECT.
> Conversely that should work after old_snapshot_threshold elapsed.
> 
> Am I missing something?

I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check in
TestForOldSnapshot().  If the LSN isn't important, what else explains
RelationAllowsEarlyPruning() checking RelationNeedsWAL()?




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Fujii Masao




On 2021/01/18 15:02, Hou, Zhijie wrote:

We need to create the loopback3 with user mapping public, otherwise the
test might become unstable as shown below. Note that loopback and
loopback2 are not dropped in the test, so no problem with them.

  ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');  DROP
SERVER loopback3 CASCADE;
  NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to user mapping for postgres on server loopback3
+DETAIL:  drop cascades to user mapping for bharath on server loopback3

Attaching v2 patch for postgres_fdw_get_connections. Please have a look.

Hi

I have a comment for the doc about postgres_fdw_get_connections.

+postgres_fdw_get_connections(OUT server_name text, OUT valid boolean) 
returns setof record
+
+ 
+  This function returns the foreign server names of all the open
+  connections that postgres_fdw established from
+  the local session to the foreign servers. It also returns whether
+  each connection is valid or not. false is returned
+  if the foreign server connection is used in the current local
+  transaction but its foreign server or user mapping is changed or
+  dropped, and then such invalid connection will be closed at
+  the end of that transaction. true is returned
+  otherwise. If there are no open connections, no record is returned.
+  Example usage of the function:

The doc seems does not memtion the case when the function returns NULL in 
server_name.
Users may be a little confused about why NULL was returned.


Yes, so what about adding

(Note that the returned server name of invalid connection is NULL if its 
server is dropped)

into the following (just after "dropped")?

+  if the foreign server connection is used in the current local
+  transaction but its foreign server or user mapping is changed or
+  dropped

Or better description?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: New IndexAM API controlling index vacuum strategies

2021-01-17 Thread Masahiko Sawada
On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan  wrote:
> > > >
> > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan  wrote:
> > > > > I experimented with this today, and I think that it is a good way to
> > > > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > > > heap pages that are subject to many non-HOT updates have a "natural
> > > > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > > > least with non-default heap fill factor settings). My early testing
> > > > > shows that it will often take a surprisingly long time for the most
> > > > > heavily updated heap page to have more than about 100 LP_DEAD items.
> > > >
> > > > Attached is a rough patch showing what I did here. It was applied on
> > > > top of my bottom-up index deletion patch series and your
> > > > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > > > dirty way of simulating what I thought would work best for bottom-up
> > > > index deletion for one specific benchmark/test, which was
> > > > non-hot-update heavy. This consists of a variant pgbench with several
> > > > indexes on pgbench_accounts (almost the same as most other bottom-up
> > > > deletion benchmarks I've been running). Only one index is "logically
> > > > modified" by the updates, but of course we still physically modify all
> > > > indexes on every update. I set fill factor to 90 for this benchmark,
> > > > which is an important factor for how your VACUUM patch works during
> > > > the benchmark.
> > > >
> > > > This rough supplementary patch includes VACUUM logic that assumes (but
> > > > doesn't check) that the table has heap fill factor set to 90 -- see my
> > > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > > stability over time more than performance (though performance is also
> > > > improved significantly). I wanted to keep both the table/heap and the
> > > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > > pgbench_accounts) exactly the same size *forever*.
> > > >
> > > > Does this make sense?
> > >
> > > Thank you for sharing the patch. That makes sense.
> > >
> > > +if (!vacuum_heap)
> > > +{
> > > +if (maxdeadpage > 130 ||
> > > +/* Also check if maintenance_work_mem space is running 
> > > out */
> > > +vacrelstats->dead_tuples->num_tuples >
> > > +vacrelstats->dead_tuples->max_tuples / 2)
> > > +vacuum_heap = true;
> > > +}
> > >
> > > The second test checking if maintenane_work_mem space is running out
> > > also makes sense to me. Perhaps another idea would be to compare the
> > > number of collected garbage tuple to the total number of heap tuples
> > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > > certain amount of garbage in the table.
> > >
> > > >
> > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > > > overnight run for the patch series (it was 508,820,699 for the
> > > > patches, 705,282,975 for the master branch). I think that the VACUUM
> > > > component is responsible for some of that reduction. There were 11
> > > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > > > one problematic index on the table, named "abalance_ruin", which my
> > > > supplementary patch has hard-coded knowledge of).
> > >
> > > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > > much the table and indexes bloated? Also, I'm curious about that which
> > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > > test if maintenance_work_mem space is running out? And what was the
> > > impact on clearing all-visible bits?
> > >
> >
> > I merged these patches and polished it.
> >
> > In the 0002 patch, we calculate how many LP_DEAD items can be
> > accumulated in the space on a single heap page left by fillfactor. I
> > increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
> > on a heap page. Because otherwise accumulating LP_DEAD items
> > unnecessarily constrains the number of heap tuples in a single page,
> > especially when small tuples, as I mentioned before. Previously, we
> > constrained the number of line pointers to avoid excessive
> > line-pointer bloat and not require an increase in the size of the work
> > array. However, once amvacuumstrategy stuff entered the picture,
> > accumulating line pointers has value. Also, we might want to store the
> > returned value of amvacuumstrategy so that index AM can refer to it on
> > index-deletion.
> >
> > The 0003 

Re: Pg14, pg_dumpall and "password_encryption=true"

2021-01-17 Thread Michael Paquier
On Sun, Jan 17, 2021 at 02:20:10PM -0800, Noah Misch wrote:
> On Sun, Jan 17, 2021 at 01:51:35PM +0100, Magnus Hagander wrote:
>> Option 3 would be the closest to how other things work though,
>> wuodln't it? Normally, it's the job of pg_dump (or pg_dumpall) to
>> adapt the dump to the new version of PostgreSQL.
> 
> I didn't do a precedent search, but I can't think of an instance of those
> programs doing (3).  We have things like guessConstraintInheritance() that
> make up for lack of information in old versions, but dumpFunc() doesn't mutate
> any proconfig setting values.  Is there a particular pg_dump behavior you had
> in mind?

I don't recall any code paths in pg_dump touching array parsing that
enforces a value to something else if something is not supported.

>> In a lot of ways,  think (3) seems like the reasonable ting to do.
>> That's basically what we do when things change in the table creation
>> commands etc, so it seems like the logical place.
> 
> This would be interpreting setconfig='{password_encryption=on}' as "opt out of
> future password security increases".  I expect that will tend not to match the
> intent of the person entering the setting.  That said, if v14 were already
> behaving this way, I wouldn't dislike it enough to complain.

Hm.  Up to 13, "on" is a synonym of "md5", so it seems to me
that we should map "on" to "md5" rather than "scram-sha-256" on the
side of compatibility.  But you have a point when it comes to good
security practices.  It makes me wonder whether it would be better to
have pg_dumpall complain rather than pg_upgrade if this value is found
in the proconfig items..  pg_upgrade is not the only upgrade path
supported.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Fujii Masao




On 2021/01/18 13:46, Bharath Rupireddy wrote:

On Mon, Jan 18, 2021 at 9:38 AM Fujii Masao  wrote:

Please review v11 further.


Thanks for updating the patch!

The patch for postgres_fdw_get_connections() basically looks good to me.
So at first I'd like to push it. Attached is the patch that I extracted
postgres_fdw_get_connections() part from 0001 patch and tweaked.
Thought?


Thanks.

We need to create the loopback3 with user mapping public, otherwise
the test might become unstable as shown below. Note that loopback and
loopback2 are not dropped in the test, so no problem with them.

  ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');
  DROP SERVER loopback3 CASCADE;
  NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to user mapping for postgres on server loopback3
+DETAIL:  drop cascades to user mapping for bharath on server loopback3

Attaching v2 patch for postgres_fdw_get_connections. Please have a look.


Thanks! You're right. I pushed the v2 patch.



I will post patches for the other function postgres_fdw_disconnect,
GUC and server level option later.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Wrong usage of RelationNeedsWAL

2021-01-17 Thread Kyotaro Horiguchi
Thank you for the comments, Noah and Andres.

At Fri, 15 Jan 2021 20:38:16 -0800, Noah Misch  wrote in 
> On Wed, Jan 13, 2021 at 04:07:05PM +0900, Kyotaro Horiguchi wrote:
> > The definition of the macro RelationNeedsWAL has been changed by
> > c6b92041d3 to include conditions related to the WAL-skip optimization
> > but some uses of the macro are not relevant to the optimization. That
> > misuses are harmless for now as they are only executed while wal_level
> > >= replica or WAL-skipping is inactive. However, this should be
> > corrected to prevent future hazard.
> 
> I see user-visible consequences:
> 
> > --- a/src/backend/catalog/pg_publication.c
> > +++ b/src/backend/catalog/pg_publication.c
> > @@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
> >  errdetail("System tables cannot be added to 
> > publications.")));
> >  
> > /* UNLOGGED and TEMP relations cannot be part of publication. */
> > -   if (!RelationNeedsWAL(targetrel))
> > +   if (!RelationIsWalLogged(targetrel))
> 
> Today, the following fails needlessly under wal_level=minimal:
> 
> BEGIN;
> SET client_min_messages = 'ERROR';
> CREATE TABLE skip_wal ();
> CREATE PUBLICATION p FOR TABLE skip_wal;
> ROLLBACK;
> 
> Could you add that test to one of the regress scripts?

Mmm. I thought that it cannot be used while wal_level=minimal. The
WARNING for insiffucient wal_level shows that it is intended to
work. A test is added in the attached.

> > ereport(ERROR,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >  errmsg("table \"%s\" cannot be replicated",
> > diff --git a/src/backend/optimizer/util/plancat.c 
> > b/src/backend/optimizer/util/plancat.c
> > index da322b453e..0500efcdb9 100644
> > --- a/src/backend/optimizer/util/plancat.c
> > +++ b/src/backend/optimizer/util/plancat.c
> > @@ -126,7 +126,7 @@ get_relation_info(PlannerInfo *root, Oid 
> > relationObjectId, bool inhparent,
> > relation = table_open(relationObjectId, NoLock);
> >  
> > /* Temporary and unlogged relations are inaccessible during recovery. */
> > -   if (!RelationNeedsWAL(relation) && RecoveryInProgress())
> > +   if (!RelationIsWalLogged(relation) && RecoveryInProgress())
> 
> This has no user-visible consequences, but I agree you've clarified it.
> 
> > ereport(ERROR,
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >  errmsg("cannot access temporary or unlogged 
> > relations during recovery")));
> > diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> > index 10b63982c0..810806a542 100644
> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -552,16 +552,23 @@ typedef struct ViewOptions
> > (relation)->rd_smgr->smgr_targblock = (targblock); \
> > } while (0)
> >  
> > +/*
> > + * RelationIsPermanent
> > + * True if relation is WAL-logged.
> > + */
> > +#define RelationIsWalLogged(relation)  
> > \
> > +   ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
> 
> To me, the names RelationIsWalLogged() and RelationNeedsWAL() don't convey
> their behavior difference.  How about one of the following spellings?

Yeah! I was concerned about that.

> - Name the new macro RelationEverNeedsWAL().
> - Just test "relpersistence == RELPERSISTENCE_PERMANENT", without a macro.

Yes. I wasn't confident on using the macro since as you pointed the
macro name is very confusing.  The reason for using a macro was
RelationUsesLocalBuffers().

I'm not sure the exact meaning the "ever" (doesn't seem to be
"always"). On the other hand there are many places where the second
line above takes place. So I chose to do that without a macro.

> > +
> >  /*
> >   * RelationNeedsWAL
> > - * True if relation needs WAL.
> > + * True if relation needs WAL at the time.
> >   *
> >   * Returns false if wal_level = minimal and this relation is created or
> >   * truncated in the current transaction.  See "Skipping WAL for New
> >   * RelFileNode" in src/backend/access/transam/README.
> >   */
> >  #define RelationNeedsWAL(relation) 
> > \
> > -   ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&  
> > \
> > +   (RelationIsWalLogged(relation) &&   
> > \
> >  (XLogIsNeeded() || 
> > \
> >   (relation->rd_createSubid == InvalidSubTransactionId &&   
> > \
> >relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
> > @@ -619,7 +626,7 @@ typedef struct ViewOptions
> >   */
> >  #define RelationIsAccessibleInLogicalDecoding(relation) \
> > 

RE: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Hou, Zhijie
> We need to create the loopback3 with user mapping public, otherwise the
> test might become unstable as shown below. Note that loopback and
> loopback2 are not dropped in the test, so no problem with them.
> 
>  ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');  DROP
> SERVER loopback3 CASCADE;
>  NOTICE:  drop cascades to 2 other objects
> -DETAIL:  drop cascades to user mapping for postgres on server loopback3
> +DETAIL:  drop cascades to user mapping for bharath on server loopback3
> 
> Attaching v2 patch for postgres_fdw_get_connections. Please have a look.
Hi

I have a comment for the doc about postgres_fdw_get_connections.

+postgres_fdw_get_connections(OUT server_name text, OUT 
valid boolean) returns setof record
+
+ 
+  This function returns the foreign server names of all the open
+  connections that postgres_fdw established from
+  the local session to the foreign servers. It also returns whether
+  each connection is valid or not. false is returned
+  if the foreign server connection is used in the current local
+  transaction but its foreign server or user mapping is changed or
+  dropped, and then such invalid connection will be closed at
+  the end of that transaction. true is returned
+  otherwise. If there are no open connections, no record is returned.
+  Example usage of the function:

The doc seems does not memtion the case when the function returns NULL in 
server_name.
Users may be a little confused about why NULL was returned.

Best regards,
houzj




Re: Is Recovery actually paused?

2021-01-17 Thread Dilip Kumar
On Sun, Jan 17, 2021 at 1:48 PM Dilip Kumar  wrote:
>
> On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > > > >
> > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > > Dilip Kumar  wrote:
> > > > >
> > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > > > wait.
> > > > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > > > blocked forever,
> > > > > > > > although this setting may not be usual. In addition, some users 
> > > > > > > > may set
> > > > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > > > pg_is_wal_replay_paused,
> > > > > > > > it could wait for a long time.
> > > > > > > >
> > > > > > > > At least, I think we need some descriptions on document to 
> > > > > > > > explain
> > > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > > >
> > > > > > > Ok
> > > > > >
> > > > > > Fixed this, added some comments in .sgml as well as in function 
> > > > > > header
> > > > >
> > > > > Thank you for fixing this.
> > > > >
> > > > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > > > "Pauses recovery." to "Request to pause recovery." in according with
> > > > > pg_is_wal_replay_paused?
> > > >
> > > > Okay
> > > >
> > > > >
> > > > > > > > Also, how about adding a new boolean argument to 
> > > > > > > > pg_is_wal_replay_paused to
> > > > > > > > control whether this waits for recovery to get paused or not? 
> > > > > > > > By setting its
> > > > > > > > default value to true or false, users can use the old format 
> > > > > > > > for calling this
> > > > > > > > and the backward compatibility can be maintained.
> > > > > > >
> > > > > > > So basically, if the wait_recovery_pause flag is false then we 
> > > > > > > will
> > > > > > > immediately return true if the pause is requested?  I agree that 
> > > > > > > it is
> > > > > > > good to have an API to know whether the recovery pause is 
> > > > > > > requested or
> > > > > > > not but I am not sure is it good idea to make this API serve both 
> > > > > > > the
> > > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > > >
> > > > >
> > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > purpose;
> > > > > this waits recovery to actually get paused. If we want to limit this 
> > > > > API's
> > > > > purpose only to return the pause state, it seems better to fix this 
> > > > > to return
> > > > > the actual state at the cost of lacking the backward compatibility. 
> > > > > If we want
> > > > > to know whether pause is requested, we may add a new API like
> > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > recovery to actually
> > > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > > purpose.
> > > > >
> > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > > > care either.
> > > >
> > > > I don't think that it will be blocked ever, because
> > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > recovery process will not be stuck on waiting for the WAL.
> > > >
> > > > > > > > As another comment, while pg_is_wal_replay_paused is blocking, 
> > > > > > > > I can not cancel
> > > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > > > waiting loop.
> > > > >
> > > > > How about this fix? I think users may want to cancel 
> > > > > pg_is_wal_replay_paused() during
> > > > > this is blocking.
> > > >
> > > > Yeah, we can do this.  I will send the updated patch after putting
> > > > some more thought into these comments.  Thanks again for the feedback.
> > > >
> > >
> > > Please find the updated patch.
> >
> > I've looked at the patch. Here are review comments:
> >
> > +   /* Recovery pause state */
> > +   RecoveryPauseState  recoveryPause;
> >
> > Now that the value can have tri-state, how about renaming it to
> > recoveryPauseState?
>
> This makes sense to me.
>
> > ---
> >  bool
> >  RecoveryIsPaused(void)
> > +{
> > +   boolrecoveryPause;
> > +
> > +   SpinLockAcquire(>info_lck);
> > +   recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ?
> > true : false;
> > +   SpinLockRelease(>info_lck);
> > +
> > +   return recoveryPause;
> > +}
> > +
> > +bool
> > +RecoveryPauseRequested(void)
> >  {
> > boolrecoveryPause;
> >
> > SpinLockAcquire(>info_lck);
> > -   recoveryPause = XLogCtl->recoveryPause;
> > +   recoveryPause = (XLogCtl->recoveryPause !=
> > RECOVERY_IN_PROGRESS) ? true : false;
> > SpinLockRelease(>info_lck);
> >
> > return 

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-17 Thread Masahiko Sawada
On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu  wrote:
>
> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>
> +   entry->changing_xact_state = true;
> ...
> +   entry->changing_xact_state = abort_cleanup_failure;
>
> I don't see return statement in between the two assignments. I wonder why 
> entry->changing_xact_state is set to true, and later being assigned again.

Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

/*
 * If connection is before starting transaction or is already unsalvageable,
 * do only the cleanup and don't touch it further.
 */
if (entry->changing_xact_state)
{
pgfdw_cleanup_after_transaction(entry);
return;
}

>
> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>
> bq. This commits introduces to new background processes: foreign
>
> commits introduces to new -> commit introduces two new

Fixed.

>
> +FdwXactExistsXid(TransactionId xid)
>
> Since Xid is the parameter to this method, I think the Xid suffix can be 
> dropped from the method name.

But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?

>
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> Please correct year in the next patch set.

Fixed.

>
> +FdwXactLauncherRequestToLaunch(void)
>
> Since the launcher's job is to 'launch', I think the Launcher can be omitted 
> from the method name.

Agreed. How about FdwXactRequestToLaunchResolver()?

>
> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> +Size
> +FdwXactRslvShmemSize(void)
>
> Are both Rsover and Rslv referring to resolver ? It would be better to use 
> whole word which reduces confusion.
> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or 
> FdwXactResolveShmemInit)

Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.

>
> +fdwxact_launch_resolver(Oid dbid)
>
> The above method is not in camel case. It would be better if method names are 
> consistent (in casing).

Fixed.

>
> +errmsg("out of foreign transaction resolver slots"),
> +errhint("You might need to increase 
> max_foreign_transaction_resolvers.")));
>
> It would be nice to include the value of max_foreign_xact_resolvers

I agree it would be nice but looking at other code we don't include
the value in this kind of messages.

>
> For fdwxact_resolver_onexit():
>
> +   LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> +   fdwxact->locking_backend = InvalidBackendId;
> +   LWLockRelease(FdwXactLock);
>
> There is no call to method inside the for loop which may take time. I wonder 
> if the lock can be obtained prior to the for loop and released coming out of 
> the for loop.

Agreed.

>
> +FXRslvLoop(void)
>
> Please use Resolver instead of Rslv

Fixed.

>
> +   FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>
> Fdw and Xact are repeated twice each in the method name. Probably the method 
> name can be made shorter.

Fixed.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




RE: libpq debug log

2021-01-17 Thread k.jami...@fujitsu.com
Hi Iwata-san,

In addition to Tsunakawa-san's comments,
The compiler also complains:
  fe-misc.c:678:20: error: ‘lenPos’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
conn->outMsgStart = lenPos;

There's no need for variable lenPos anymore since we have decided *not* to 
support pre protocol 3.0.
And by that we have to update the description of pqPutMsgStart too. 
So you can remove the lenPos variable and the condition where you have to check 
for protocol version.

diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 8bc9966..3de48be 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -644,14 +644,12 @@ pqCheckInBufferSpace(size_t bytes_needed, PGconn *conn)
  *
  * The state variable conn->outMsgStart points to the incomplete message's
  * length word: it is either outCount or outCount+1 depending on whether
- * there is a type byte.  If we are sending a message without length word
- * (pre protocol 3.0 only), then outMsgStart is -1.  The state variable
- * conn->outMsgEnd is the end of the data collected so far.
+ * there is a type byte.  The state variable conn->outMsgEnd is the end of
+ * the data collected so far.
  */
 int
 pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
 {
-   int lenPos;
int endPos;

/* allow room for message type byte */
@@ -661,9 +659,8 @@ pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
endPos = conn->outCount;

/* do we want a length word? */
-   if (force_len || PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
+   if (force_len)
{
-   lenPos = endPos;
/* allow room for message length */
endPos += 4;
}
@@ -675,7 +672,7 @@ pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
if (msg_type)
conn->outBuffer[conn->outCount] = msg_type;
/* set up the message pointers */
-   conn->outMsgStart = lenPos;
+   conn->outMsgStart = endPos;
conn->outMsgEnd = endPos;



At the same time, the one below lacks one more zero. (Only has 8)
There should be 9 as Matsumura-san mentioned.
+   0, 0, 0, 0, 0, 0, 0, 0, /* \x65 ... \0x6d */


The following can be removed in pqStoreFrontendMsg():
+ * In protocol v2, we immediately print each message as we receive 
it.
+ * (XXX why?)


Maybe the following description can be paraphrased:
+ * The message length is fixed after putting the last field, but 
message
+ * length should be print before printing any fields.So we must 
store the
+ * field data in memory.
to:
+ * The message length is fixed after putting the last field. But 
message
+ * length should be printed before printing any field, so we must 
store
+ * the field data in memory.


In pqStoreFrontendMsg, pqLogMessagenchar, pqLogMessageString,
pqLogMessageInt, pqLogMessageByte1, maybe it is unneccessary to use
+   if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
because you have already indicated in the PQtrace() to return
silently when protocol 2.0 is detected.
+   /* Protocol 2.0 is not supported. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;


In pqLogMessageInt(),
+   /* We delayed to print message type for special message. */
can be paraphrased to:
  /* We delay printing of the following special message_type */


Regards,
Kirk Jamison





Re: New IndexAM API controlling index vacuum strategies

2021-01-17 Thread Masahiko Sawada
On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan  wrote:
> > >
> > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan  wrote:
> > > > I experimented with this today, and I think that it is a good way to
> > > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > > heap pages that are subject to many non-HOT updates have a "natural
> > > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > > least with non-default heap fill factor settings). My early testing
> > > > shows that it will often take a surprisingly long time for the most
> > > > heavily updated heap page to have more than about 100 LP_DEAD items.
> > >
> > > Attached is a rough patch showing what I did here. It was applied on
> > > top of my bottom-up index deletion patch series and your
> > > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > > dirty way of simulating what I thought would work best for bottom-up
> > > index deletion for one specific benchmark/test, which was
> > > non-hot-update heavy. This consists of a variant pgbench with several
> > > indexes on pgbench_accounts (almost the same as most other bottom-up
> > > deletion benchmarks I've been running). Only one index is "logically
> > > modified" by the updates, but of course we still physically modify all
> > > indexes on every update. I set fill factor to 90 for this benchmark,
> > > which is an important factor for how your VACUUM patch works during
> > > the benchmark.
> > >
> > > This rough supplementary patch includes VACUUM logic that assumes (but
> > > doesn't check) that the table has heap fill factor set to 90 -- see my
> > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > stability over time more than performance (though performance is also
> > > improved significantly). I wanted to keep both the table/heap and the
> > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > pgbench_accounts) exactly the same size *forever*.
> > >
> > > Does this make sense?
> >
> > Thank you for sharing the patch. That makes sense.
> >
> > +if (!vacuum_heap)
> > +{
> > +if (maxdeadpage > 130 ||
> > +/* Also check if maintenance_work_mem space is running out 
> > */
> > +vacrelstats->dead_tuples->num_tuples >
> > +vacrelstats->dead_tuples->max_tuples / 2)
> > +vacuum_heap = true;
> > +}
> >
> > The second test checking if maintenane_work_mem space is running out
> > also makes sense to me. Perhaps another idea would be to compare the
> > number of collected garbage tuple to the total number of heap tuples
> > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > certain amount of garbage in the table.
> >
> > >
> > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > > overnight run for the patch series (it was 508,820,699 for the
> > > patches, 705,282,975 for the master branch). I think that the VACUUM
> > > component is responsible for some of that reduction. There were 11
> > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > > one problematic index on the table, named "abalance_ruin", which my
> > > supplementary patch has hard-coded knowledge of).
> >
> > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > much the table and indexes bloated? Also, I'm curious about that which
> > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > test if maintenance_work_mem space is running out? And what was the
> > impact on clearing all-visible bits?
> >
>
> I merged these patches and polished it.
>
> In the 0002 patch, we calculate how many LP_DEAD items can be
> accumulated in the space on a single heap page left by fillfactor. I
> increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
> on a heap page. Because otherwise accumulating LP_DEAD items
> unnecessarily constrains the number of heap tuples in a single page,
> especially when small tuples, as I mentioned before. Previously, we
> constrained the number of line pointers to avoid excessive
> line-pointer bloat and not require an increase in the size of the work
> array. However, once amvacuumstrategy stuff entered the picture,
> accumulating line pointers has value. Also, we might want to store the
> returned value of amvacuumstrategy so that index AM can refer to it on
> index-deletion.
>
> The 0003 patch has btree indexes skip bulk-deletion if the index
> doesn't grow since last bulk-deletion. I stored the number of blocks
> in the meta page but didn't implement meta page upgrading.
>

After more thought, I think that ambulkdelete needs 

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-17 Thread Michael Paquier
On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote:
> It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and ReindexParams
> In my v31 patch, I moved ReindexOptions to a private structure in indexcmds.c,
> with an "int options" bitmask which is passed to reindex_index() et al.  Your
> patch keeps/puts ReindexOptions index.h, so it also applies to reindex_index,
> which I think is good.

a3dc926 is an equivalent of 0001~0003 merged together.  0008 had
better be submitted into a separate thread if there is value to it.
0004~0007 are the pieces remaining.  Could it be possible to rebase
things on HEAD and put the tablespace bits into the structures 
{Vacuum,Reindex,Cluster}Params?
--
Michael


signature.asc
Description: PGP signature


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-17 Thread Greg Nancarrow
On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila  wrote:
>
> Here is an additional review of
> v11-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT. There are
> quite a few comments raised on the V11-0001* patch. I suggest first
> post a revised version of V11-0001* patch addressing those comments
> and then you can separately post a revised version of
> v11-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO.
>

1)

>Here, it seems we are accessing the relation descriptor without any
>lock on the table which is dangerous considering the table can be
>modified in a parallel session. Is there a reason why you think this
>is safe? Did you check anywhere else such a coding pattern?

Yes, there's a very good reason and I certainly have checked for the
same coding pattern elsewhere, and not just randomly decided that
locking can be ignored.
The table has ALREADY been locked (by the caller) during the
parse/analyze phase.
(This is not the case for a partition, in which case the patch code
uses AccessShareLock, as you will see).
And BTW, with asserts enabled, an attempt to table_open() with NoLock
when you haven't already locked the table will fire an assert - see
following code in relation_open():

/*
 * If we didn't get the lock ourselves, assert that caller holds one,
 * except in bootstrap mode where no locks are used.
 */
Assert(lockmode != NoLock ||
   IsBootstrapProcessingMode() ||
   CheckRelationLockedByMe(r, AccessShareLock, true));

2)

>+ /*
>+ * If there are any index expressions, check that they are parallel-mode
>+ * safe.
>+ */
>+ max_hazard = index_expr_max_parallel_hazard_for_modify(rel, context);
>+ if (max_parallel_hazard_test(max_hazard, context))
>+ {
>+ table_close(rel, lockmode);
>+ return context->max_hazard;
>+ }

>Here and at all other similar places, the call to
>max_parallel_hazard_test seems redundant because
>index_expr_max_parallel_hazard_for_modify would have already done
>that. Why can't we just return true/false from
>index_expr_max_parallel_hazard_for_modify? The context would have been
>already updated for max_hazard.

Yes, you're right, it's redundant to call max_parallel_hazard_test(_)
again. The max_hazard can always be retrieved from the context if
needed (rather than explicitly returned), so I'll change this function
(and any similar cases) to just return true if the max_hazard of
interest has been reached.

3)

>I don't like this way of checking parallel_hazard for modify. This not
>only duplicates some code in max_parallel_hazard_for_modify from
>max_parallel_hazard but also appears quite awkward. Can we move
>max_parallel_hazard_for_modify inside max_parallel_hazard? Basically,
>after calling max_parallel_hazard_walker, we can check for modify
>statement and call the new function.

Agree, I'll move it, as you suggest.

4)

>domain_max_parallel_hazard_for_modify()
>{
>..
>+ if (isnull)
>+ {
>+ /*
>+ * This shouldn't ever happen, but if it does, log a WARNING
>+ * and return UNSAFE, rather than erroring out.
>+ */
>+ elog(WARNING, "null conbin for constraint %u", con->oid);
>+ context->max_hazard = PROPARALLEL_UNSAFE;
>+ break;
>+ }
>..
>}
>index_expr_max_parallel_hazard_for_modify()
>{
>..
>+ if (index_expr_item == NULL) /* shouldn't happen */
>+ {
>+ index_close(index_rel, lockmode);
>+ context->max_hazard = PROPARALLEL_UNSAFE;
>+ return context->max_hazard;
>+ }
>..
>}

>It is not clear why the above two are shouldn't happen cases and if so
>why we want to treat them as unsafe. Ideally, there should be an
>Assert if these can't happen but it is difficult to decide without
>knowing why you have considered them unsafe?

The checks being done here for "should never happen" cases are THE
SAME as other parts of the Postgres code.
For example, search Postgres code for "null conbin" and you'll find 6
other places in the Postgres code which actually ERROR out if conbin
(binary representation of the constraint) in a pg_constraint tuple is
found to be null.
The cases that you point out in the patch used to also error out in
the same way, but Vignesh suggested changing them to just return
parallel-unsafe instead of erroring-out, which I agree with. Such
cases could surely ever only happen if the DB had been corrupted, so
extremely rare IMHO and most likely to have caused an ERROR elsewhere
before ever reaching here...
I can add some Asserts to the current code, to better alert for such
cases, for when asserts are enabled, but otherwise I think that the
current code is OK (cleaning up other Postgres code is beyond the
scope of the task here).


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-17 Thread Michael Paquier
On Thu, Jan 14, 2021 at 02:18:45PM +0900, Michael Paquier wrote:
> Indeed.  Let's first wait a couple of days and see if others have any
> comments or objections about this v6.

Hearing nothing, I have looked at that again this morning and applied
v6 after a reindent and some adjustments in the comments.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-17 Thread Pavel Stehule
čt 14. 1. 2021 v 18:09 odesílatel Dian M Fay  napsal:

> On Thu Jan 14, 2021 at 10:04 AM EST, Dmitry Dolgov wrote:
> > > On Tue, Jan 12, 2021 at 08:02:59PM +0100, Pavel Stehule wrote:
> > > ne 10. 1. 2021 v 19:52 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com>
> > > napsal:
> > >
> > > I tested behaviour and I didn't find anything other than the mentioned
> > > issue.
> > >
> > > Now I can check this feature from plpgsql, and it is working. Because
> there
> > > is no special support in plpgsql runtime, the update of jsonb is
> > > significantly slower than in update of arrays, and looks so update of
> jsonb
> > > has O(N2) cost. I don't think it is important at this moment - more
> > > important is fact, so I didn't find any memory problems.
> >
> > Thanks for testing. Regarding updates when the structure doesn't match
> > provided path as I've mentioned I don't have strong preferences, but on
> > the second though probably more inclined for returning an error in this
> > case. Since there are pros and cons for both suggestions, it could be
> > decided by vote majority between no update (Dian) or an error (Pavel,
> > me) options. Any +1 to one of the options from others?
> >
> > Other than that, since I've already posted the patch for returning an
> > error option, it seems that the only thing left is to decide with which
> > version to go.
>
> The trigger issue (which I did verify) makes the "no update" option
> unworkable imo, JavaScript's behavior notwithstanding. But it should be
> called out very clearly in the documentation, since it does depart from
> what people more familiar with that behavior may expect. Here's a quick
> draft, based on your v44 patch:
>
> 
>  jsonb data type supports array-style subscripting expressions
>  to extract or update particular elements. It's possible to use multiple
>  subscripting expressions to extract nested values. In this case, a chain
> of
>  subscripting expressions follows the same rules as the
>  path argument in jsonb_set function,
>  e.g. in case of arrays it is a 0-based operation or that negative integers
>  that appear in path count from the end of JSON arrays.
>  The result of subscripting expressions is always of the jsonb data type.
> 
> 
>  UPDATE statements may use subscripting in the
>  SET clause to modify jsonb values. Every
>  affected value must conform to the path defined by the subscript(s). If
> the
>  path cannot be followed to its end for any individual value (e.g.
>  val['a']['b']['c'] where val['a'] or
>  val['b'] is null, a string, or a number), an error is
>  raised even if other values do conform.
> 
> 
>  An example of subscripting syntax:
>

+1

Pavel


Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls

2021-01-17 Thread Pavan Deolasee
Hi Alvaro,

On Tue, Sep 29, 2020 at 10:14 PM Alvaro Herrera 
wrote:

> Hello
>
> Pavan Deolasee recently noted that a few of the
> HeapTupleHeaderIndicatesMovedPartitions calls added by commit
> 5db6df0c0117 are useless, since they are done after comparing t_self
> with t_ctid.  That's because t_self can never be set to the magical
> values that indicate that the tuple moved partition.  If the first
> test fails (so we know t_self equals t_ctid), necessarily the second
> test will also fail.
>
> So these checks can be removed and no harm is done.
>
>
The patch looks good to me. The existing coding pattern was a bit confusing
and that's how I noticed it. So +1 for fixing it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> We already allow users to specify the degree of parallelism for all
> the parallel operations via guc's max_parallel_maintenance_workers,
> max_parallel_workers_per_gather, then we have a reloption
> parallel_workers and vacuum command has the parallel option where
> users can specify the number of workers that can be used for
> parallelism. The parallelism considers these as hints but decides
> parallelism based on some other parameters like if there are that many
> workers available, etc. Why the users would expect differently for
> parallel DML?

I agree that the user would want to specify the degree of parallelism of DML, 
too.  My simple (probably silly) question was, in INSERT SELECT,

* If the target table has 10 partitions and the source table has 100 
partitions, how would the user want to specify parameters?

* If the source and target tables have the same number of partitions, and the 
user specified different values to parallel_workers and parallel_dml_workers, 
how many parallel workers would run?

* What would the query plan be like?  Something like below?  Can we easily 
support this sort of nested thing?

Gather
  Workers Planned: 
  Insert
Gather
  Workers Planned: 
  Parallel Seq Scan


> Which memory specific to partitions are you referring to here and does
> that apply to the patch being discussed?

Relation cache and catalog cache, which are not specific to partitions.  This 
patch's current parallel safety check opens and closes all descendant 
partitions of the target table.  That leaves those cache entries in 
CacheMemoryContext after the SQL statement ends.  But as I said, we can 
consider it's not a serious problem in this case because the parallel DML would 
be executed in limited number of concurrent sessions.  I just touched on the 
memory consumption issue for completeness in comparison with (3).


Regards
Takayuki Tsunakawa



Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-17 Thread Pavan Deolasee
On Mon, Jan 18, 2021 at 3:02 AM Tomas Vondra 
wrote:

>
>
> Pushed. Thanks everyone for the effort put into this patch. The first
> version was sent in 2015, so it took quite a bit of time.
>
>
Thanks Tomas, Anastasia and everyone else who worked on the patch and
ensured that it gets into the tree.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Justin Pryzby
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR:  corrupt MVNDistinct entry
> 
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in
> mismatching the ndistinct coefficient items. The attached patch fixes that,
> but I've realized the way we pick the "best" statistics may need some
> improvements (I added an XXX comment about that).

That maybe indicates a deficiency in testing and code coverage.

| postgres=# CREATE TABLE t(i int);
| postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
| postgres=# \d t
|  Table "public.t"
|  Column |  Type   | Collation | Nullable | Default 
| +-+---+--+-
|  i  | integer |   |  | 
| Indexes:
| "t_i_idx" btree (i)
| Statistics objects:
| "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t

on ... what ?

-- 
Justin




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Bharath Rupireddy
On Mon, Jan 18, 2021 at 9:38 AM Fujii Masao  wrote:
> > Please review v11 further.
>
> Thanks for updating the patch!
>
> The patch for postgres_fdw_get_connections() basically looks good to me.
> So at first I'd like to push it. Attached is the patch that I extracted
> postgres_fdw_get_connections() part from 0001 patch and tweaked.
> Thought?

Thanks.

We need to create the loopback3 with user mapping public, otherwise
the test might become unstable as shown below. Note that loopback and
loopback2 are not dropped in the test, so no problem with them.

 ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');
 DROP SERVER loopback3 CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to user mapping for postgres on server loopback3
+DETAIL:  drop cascades to user mapping for bharath on server loopback3

Attaching v2 patch for postgres_fdw_get_connections. Please have a look.

I will post patches for the other function postgres_fdw_disconnect,
GUC and server level option later.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


postgres_fdw_get_connections_v2.patch
Description: Binary data


Re: Printing backtrace of postgres processes

2021-01-17 Thread Craig Ringer
On Mon, 18 Jan 2021 at 00:56, vignesh C  wrote:

>
> Thanks for your comments Andres, I will ignore it for the processes
> which do not have access to ProcSignal. I will make the changes and
> post a patch for this soon.
>

I think that's sensible.

I've had variable results with glibc's backtrace(), especially on older
platforms and/or with external debuginfo, but it's much better than
nothing. It's often not feasible to get someone to install gdb and run
commands on their production systems - they can be isolated and firewalled
or hobbled by painful change policies. Something basic built-in to
postgres, even if basic, is likely to come in very handy.


Re: fdatasync(2) on macOS

2021-01-17 Thread Tom Lane
Thomas Munro  writes:
> On Sat, Jan 16, 2021 at 6:55 AM Tom Lane  wrote:
>> I have a vague recollection that we discussed changing the default
>> wal_sync_method for Darwin years ago, but I don't recall why we
>> didn't pull the trigger.  These results certainly suggest that
>> we oughta.

> No strong preference here, at least without more information. It's
> unsettling that two of our wal_sync_methods are based on half-released
> phantom operating system features, but there doesn't seem to be much
> we can do about that other than try to understand what they do.  I see
> that the idea of defaulting to fsync_writethrough was discussed a
> decade ago and rejected[1].
> [1] 
> https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com

Ah, thanks for doing the archaeology on that.  Re-reading that old
thread, it seems like the two big arguments against making it
safe-by-default were

(1) other platforms weren't safe-by-default either.  Perhaps the
state of the art is better now, though?

(2) we don't want to force exceedingly-expensive defaults on people
who may be uninterested in reliable storage.  That seemed like a
shaky argument then and it still does now.  Still, I see the point
that suddenly degrading performance by orders of magnitude would
be a PR disaster.

regards, tom lane




Re: Add docs stub for recovery.conf

2021-01-17 Thread Craig Ringer
On Thu, 14 Jan 2021 at 03:44, Stephen Frost  wrote:

>
> Alright, how does this look?  The new entries are all under the
> 'obsolete' section to keep it out of the main line, but should work to
> 'fix' the links that currently 404 and provide a bit of a 'softer'
> landing for the other cases that currently just forcibly redirect using
> the website doc alias capability.
>

Thanks for expanding the change to other high profile obsoleted or renamed
features and tools.

One minor point. I'm not sure this is quite the best way to spell the index
entries:

+   
+ obsolete
+ pg_receivexlog
+   

as it will produce an index term "obsolete" with a list of various
components under it. While that concentrates them nicely, it means people
won't actually find them if they're using the index alphabetically.

I'd slightly prefer


+   
+ pg_receivexlog
+ pg_receivewal
+   

even though that bulks the index up a little, because then people are a bit
more likely to find it.

Your extended and revised patch retains the above style for

+   
+ trigger_file
+ promote_trigger_file
+
...
+
+ standby_mode
+ standby.signal
+

so if you intend to change it, that entry needs changing too.


> I ended up not actually doing this for the catalog -> view change of
> pg_replication_slots simply because I don't really think folks will
> misunderstand or be confused by that redirect since it's still the same
> relation.  If others disagree though, we could certainly change that
> too.
>

I agree with you.


Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Fujii Masao



On 2021/01/18 12:33, Bharath Rupireddy wrote:

On Sun, Jan 17, 2021 at 11:30 PM Zhihong Yu  wrote:

This patch introduces new function postgres_fdw_disconnect() when
called with a foreign server name discards the associated
connections with the server name.

I think the following would read better:

This patch introduces a new function postgres_fdw_disconnect(). When
called with a foreign server name, it discards the associated
connections with the server.


Thanks. I corrected the commit message.


Please note the removal of the 'name' at the end - connection is with server, 
not server name.

+   if (is_in_use)
+   ereport(WARNING,
+   (errmsg("cannot close the connection because it is still in 
use")));

It would be better to include servername in the message.


User would have provided the servername in
postgres_fdw_disconnect('myserver'), I don't think we need to emit the
warning again with the servername. The existing warning seems fine.


+   ereport(WARNING,
+   (errmsg("cannot close all connections because some of them 
are still in use")));

I think showing the number of active connections would be more informative.
This can be achieved by changing active_conn_exists from bool to int (named 
active_conns, e.g.):

+   if (entry->conn && !active_conn_exists)
+   active_conn_exists = true;

Instead of setting the bool value, active_conns can be incremented.


IMO, the number of active connections is not informative, because
users can not do anything with them. What's actually more informative
would be to list all the server names for which the connections are
active, instead of the warning - "cannot close all connections because
some of them are still in use". Having said that, I feel like it's an
overkill for now to do that. If required, we can enhance the warnings
in future. Thoughts?

Attaching v11 patch set, with changes only in 0001. The changes are
commit message correction and moved the warning related code to
disconnect_cached_connections from postgres_fdw_disconnect.

Please review v11 further.


Thanks for updating the patch!

The patch for postgres_fdw_get_connections() basically looks good to me.
So at first I'd like to push it. Attached is the patch that I extracted
postgres_fdw_get_connections() part from 0001 patch and tweaked.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index ee8a80a392..c1b0cad453 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -14,7 +14,7 @@ PG_CPPFLAGS = -I$(libpq_srcdir)
 SHLIB_LINK_INTERNAL = $(libpq)
 
 EXTENSION = postgres_fdw
-DATA = postgres_fdw--1.0.sql
+DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql
 
 REGRESS = postgres_fdw
 
diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index eaedfea9f2..a1404cb6bb 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -16,12 +16,14 @@
 #include "access/xact.h"
 #include "catalog/pg_user_mapping.h"
 #include "commands/defrem.h"
+#include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postgres_fdw.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
+#include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -74,6 +76,11 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
+/*
+ * SQL functions
+ */
+PG_FUNCTION_INFO_V1(postgres_fdw_get_connections);
+
 /* prototypes of private functions */
 static void make_new_connection(ConnCacheEntry *entry, UserMapping *user);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
@@ -1335,3 +1342,131 @@ exit:   ;
*result = last_res;
return timed_out;
 }
+
+/*
+ * List active foreign server connections.
+ *
+ * This function takes no input parameter and returns setof record made of
+ * following values:
+ * - server_name - server name of active connection. In case the foreign server
+ *   is dropped but still the connection is active, then the server name will
+ *   be NULL in output.
+ * - valid - true/false representing whether the connection is valid or not.
+ *  Note that the connections can get invalidated in pgfdw_inval_callback.
+ *
+ * No records are returned when there are no cached connections at all.
+ */
+Datum
+postgres_fdw_get_connections(PG_FUNCTION_ARGS)
+{
+#define POSTGRES_FDW_GET_CONNECTIONS_COLS  2
+   ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+   TupleDesc   tupdesc;
+   Tuplestorestate *tupstore;
+   MemoryContext per_query_ctx;
+   MemoryContext oldcontext;
+   HASH_SEQ_STATUS scan;
+   ConnCacheEntry *entry;

Re: Asynchronous Append on postgres_fdw nodes.

2021-01-17 Thread Etsuro Fujita
On Fri, Jan 15, 2021 at 4:54 PM Kyotaro Horiguchi
 wrote:
> Mmm. I meant that the function explicitly calls
> ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
> (if needed). Conversely if the function dosn't call
> ExecAppendAsyncRequsest, the next request to remote doesn't
> happen. That is, after the tuple buffer of FDW-side is exhausted, the
> next request doesn't happen until executor requests for the next
> tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
> fetch_more_data_begin() following its own decision."  but this doesn't
> seem to be "prefetching".

Let me explain a bit more.  Actually, the new version of the patch
allows prefetching in the FDW side; for such prefetching in
postgres_fdw, I think we could add a fetch_more_data_begin() call in
postgresForeignAsyncNotify().  But I left that for future work,
because we don’t know yet if that’s really useful.  (Another reason
why I left that is we have more important issues that should be
addressed [1], and I think addressing those issues is a requirement
for us to commit this patch, but adding such prefetching isn’t, IMO.)

> Sorry. I think I misread you here. I agree that, the notify API is not
> so useful now but would be useful if we allow notify descendents other
> than immediate children. However, I stumbled on the fact that some
> kinds of node doesn't return a result when all the underlying nodes
> returned *a* tuple. Concretely count(*) doesn't return after *all*
> tuple of the counted relation has been returned.  I remember that the
> fact might be the reason why I removed the API.  After all the topmost
> async-aware node must ask every immediate child if it can return a
> tuple.

The patch I posted, which revived Robert’s original patch using stuff
from your patch and Thomas’, provides ExecAsyncRequest() as well as
ExecAsyncNotify(), which supports pull-based execution like
ExecProcNode() (while ExecAsyncNotify() supports push-based
execution.)  In the aggregate case you mentioned, I think we could
iterate calling ExecAsyncRequest() for the underlying subplan to get
all tuples from it, in a similar way to ExecProcNode() in the normal
case.

> EPQ retrieves a specific tuple from a node. If we perform EPQ on an
> Append, only one of the children should offer a result tuple.  Since
> Append has no idea of which of its children will offer a result, it
> has no way other than asking all children until it receives a
> result. If we do that, asynchronously sending a query to all nodes
> would win.

Thanks for the explanation!  But I’m still not sure why we need to
send an asynchronous query to each of the asynchronous nodes in an EPQ
recheck.  Is it possible to explain a bit more about that?

I wrote:
> > That is what I'm thinking to be able to support the case I mentioned
> > above.  I think that that would allow us to find ready subplans
> > efficiently from occurred wait events in ExecAppendAsyncEventWait().
> > Consider a plan like this:
> >
> > Append
> > -> Nested Loop
> >   -> Foreign Scan on a
> >   -> Foreign Scan on b
> > -> ...
> >
> > I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
> > Loop are all async-capable and that we have somewhere in the executor
> > an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
> > Scan on a", an AsyncRequest with requestor="Nested Loop" and
> > requestee="Foreign Scan on b", and an AsyncRequest with
> > requestor="Append" and requestee="Nested Loop".  In
> > ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
> > becomes ready, we would call ForeignAsyncNotify() for a, and if it
> > returns a tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
> > Nested Loop*.  Nested Loop would then call ExecAsyncRequest() for the
> > inner requestee node (ie, Foreign Scan on b; I assume here that it is
> > a foreign scan parameterized by a).  If Foreign Scan on b returns a
> > tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then Nested Loop would match the tuples from the
> > outer and inner sides.  If they match, the join result would be
> > returned back to the requestor node (ie, Append) (using
> > ExecAsyncResponse()), marking the Nested Loop subplan as
> > as_needrequest.  Otherwise, Nested Loop would call ExecAsyncRequest()
> > for the inner requestee node for the next tuple, and so on.  If
> > ExecAsyncRequest() can't return a tuple immediately, we would wait
> > until a file descriptor for foreign table b becomes ready; we would
> > start from calling ForeignAsyncNotify() for b when the file descriptor
> > becomes ready.  In this way we could find ready subplans efficiently
> > from occurred wait events in ExecAppendAsyncEventWait() when extending
> > to the case where subplans are joins or aggregates over Foreign Scans,
> > I think.  Maybe I’m missing something, though.

> Maybe so. As I mentioned 

Re: fdatasync(2) on macOS

2021-01-17 Thread Thomas Munro
On Sat, Jan 16, 2021 at 6:55 AM Tom Lane  wrote:
> So it's not a no-op, but on the other hand it's not succeeding in getting
> bits down to the platter.  I'm not inclined to dike it out, but it does
> seem problematic that we're defaulting to open_datasync, which is also
> not getting bits down to the platter.

Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC
actually do something then.  It's baffling that they are undocumented.
It might be possible to use dtrace on a SIP-disabled Mac to trace the
IOs with this script, to see if the B_FUA flag is being set, which
might make open_datasync better than fdatasync (if it's being sent and
not ignored), but again, who knows?!:

https://github.com/apple/darwin-xnu/blob/master/bsd/dev/dtrace/scripts/io.d

> I have a vague recollection that we discussed changing the default
> wal_sync_method for Darwin years ago, but I don't recall why we
> didn't pull the trigger.  These results certainly suggest that
> we oughta.

No strong preference here, at least without more information. It's
unsettling that two of our wal_sync_methods are based on half-released
phantom operating system features, but there doesn't seem to be much
we can do about that other than try to understand what they do.  I see
that the idea of defaulting to fsync_writethrough was discussed a
decade ago and rejected[1].  I'm not entirely sure how it manages to
be so slow.

It looks like the reliability section of our manual could use a spring
clean[2].  It's still talking about IDE and platters, instead of
modern stuff like NVMe, cloud/network storage and FUA flags.

[1] 
https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com
[2] https://www.postgresql.org/docs/13/wal-reliability.html




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Bharath Rupireddy
On Sun, Jan 17, 2021 at 11:30 PM Zhihong Yu  wrote:
> This patch introduces new function postgres_fdw_disconnect() when
> called with a foreign server name discards the associated
> connections with the server name.
>
> I think the following would read better:
>
> This patch introduces a new function postgres_fdw_disconnect(). When
> called with a foreign server name, it discards the associated
> connections with the server.

Thanks. I corrected the commit message.

> Please note the removal of the 'name' at the end - connection is with server, 
> not server name.
>
> +   if (is_in_use)
> +   ereport(WARNING,
> +   (errmsg("cannot close the connection because it is still 
> in use")));
>
> It would be better to include servername in the message.

User would have provided the servername in
postgres_fdw_disconnect('myserver'), I don't think we need to emit the
warning again with the servername. The existing warning seems fine.

> +   ereport(WARNING,
> +   (errmsg("cannot close all connections because some of 
> them are still in use")));
>
> I think showing the number of active connections would be more informative.
> This can be achieved by changing active_conn_exists from bool to int (named 
> active_conns, e.g.):
>
> +   if (entry->conn && !active_conn_exists)
> +   active_conn_exists = true;
>
> Instead of setting the bool value, active_conns can be incremented.

IMO, the number of active connections is not informative, because
users can not do anything with them. What's actually more informative
would be to list all the server names for which the connections are
active, instead of the warning - "cannot close all connections because
some of them are still in use". Having said that, I feel like it's an
overkill for now to do that. If required, we can enhance the warnings
in future. Thoughts?

Attaching v11 patch set, with changes only in 0001. The changes are
commit message correction and moved the warning related code to
disconnect_cached_connections from postgres_fdw_disconnect.

Please review v11 further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v11-0001-postgres_fdw-functions-to-show-and-discard-cache.patch
Description: Binary data


v11-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch
Description: Binary data


v11-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data


Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Kapila
On Mon, Jan 18, 2021 at 6:08 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > I think it would be good if the parallelism works by default when
> > required but I guess if we want to use something on these lines then
> > we can always check if the parallel_workers option is non-zero for a
> > relation (with RelationGetParallelWorkers). So users can always say
> > Alter Table  Set (parallel_workers = 0) if they don't want
> > to enable write parallelism for tbl and if someone is bothered that
> > this might impact Selects as well because the same option is used to
> > compute the number of workers for it then we can invent a second
> > option parallel_dml_workers or something like that.
>
> Yes, if we have to require some specification to enable parallel DML, I agree 
> that parallel query and parall DML can be separately enabled.  With that 
> said, I'm not sure if the user, and PG developers, want to allow specifying 
> degree of parallelism for DML.
>

We already allow users to specify the degree of parallelism for all
the parallel operations via guc's max_parallel_maintenance_workers,
max_parallel_workers_per_gather, then we have a reloption
parallel_workers and vacuum command has the parallel option where
users can specify the number of workers that can be used for
parallelism. The parallelism considers these as hints but decides
parallelism based on some other parameters like if there are that many
workers available, etc. Why the users would expect differently for
parallel DML?

>
> > > As an aside, (1) and (2) has a potential problem with memory consumption.
> > >
> >
> > I can see the memory consumption argument for (2) because we might end
> > up generating parallel paths (partial paths) for reading the table but
> > don't see how it applies to (1)?
>
> I assumed that we would still open all partitions for parallel safety check 
> in (1) and (2).  In (1), parallel safety check is done only when parallel DML 
> is explicitly enabled by the user.  Just opening partitions keeps 
> CacheMemoryContext bloated even after they are closed.
>

Which memory specific to partitions are you referring to here and does
that apply to the patch being discussed?

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-17 Thread Greg Nancarrow
On Fri, Jan 8, 2021 at 8:25 PM vignesh C  wrote:
>

> For one of the test you can validate that the same transaction has
> been used by all the parallel workers, you could use something like
> below to validate:
> SELECT COUNT(*) FROM (SELECT DISTINCT cmin,xmin FROM  para_insert_p1) as dt;
>

> Few includes are not required:
>  #include "executor/nodeGather.h"
>
> This include is not required in nodeModifyTable.c
>

>
> The includes indexing.h, rewriteHandler.h & lmgr.h is not required in 
> clauses.c
>

> exeption should be exception
>
> resonable should be reasonable
>

Thanks Vignesh,
I'll be sure to make those updates in the next version of the patches.
Regards,
Greg




Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Kapila
On Sun, Jan 17, 2021 at 4:45 PM Amit Langote  wrote:
>
> On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila  wrote:
> > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  
> > wrote:
> > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  
> > > wrote:
> > > > We want to do this for Inserts where only Select can be parallel and
> > > > Inserts will always be done by the leader backend. This is actually
> > > > the case we first want to implement.
> > >
> > > Sorry, I haven't looked at the linked threads and the latest patches
> > > there closely enough yet, so I may be misreading this, but if the
> > > inserts will always be done by the leader backend as you say, then why
> > > does the planner need to be checking the parallel safety of the
> > > *target* table's expressions?
> > >
> >
> > The reason is that once we enter parallel-mode we can't allow
> > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> > enter the parallel-mode at the beginning of the statement execution,
> > see ExecutePlan(). So, the Insert will be performed in parallel-mode
> > even though it happens in the leader backend. It is not possible that
> > we finish getting all the tuples from the gather node first and then
> > start inserting. Even, if we somehow find something to make this work
> > anyway the checks being discussed will be required to make inserts
> > parallel (where inserts will be performed by workers) which is
> > actually the next patch in the thread I mentioned in the previous
> > email.
> >
> > Does this answer your question?
>
> Yes, thanks for the explanation.  I kind of figured that doing the
> insert part itself in parallel using workers would be a part of the
> end goal of this work, although that didn't come across immediately.
>
> It's a bit unfortunate that the parallel safety check of the
> individual partitions cannot be deferred until it's known that a given
> partition will be affected by the command at all.  Will we need
> fundamental changes to how parallel query works to make that possible?
>  If so, have such options been considered in these projects?
>

I think it is quite fundamental to how parallel query works and we
might not be able to change it due to various reasons like (a) it will
end up generating a lot of paths in optimizer when it is not safe to
do so and in the end, we won't use it. (b) If after inserting into a
few partitions we came to know that the next partition we are going to
insert has some parallel-unsafe constraints then we need to give up
the execution and restart the statement by again trying to first plan
it by having non-parallel paths. Now, we can optimize this by
retaining both parallel and non-parallel plans such that if we fail to
execute parallel-plan we can use a non-parallel plan to execute the
statement but still that doesn't seem like an advisable approach.

The extra time spent in optimizer will pay-off well by the parallel
execution. As pointer earlier, you can see one of the results shared
on the other thread [1]. The cases where it might not get the benefit
(say when the underlying plan is non-parallel) can have some impact
but still, we have not tested that in detail. The ideas we have
discussed so far to address that are (a) postpone parallel-safety
checks for partitions till there are some underneath partial paths
(from which parallel paths can be generated) but that has some
down-side in that we will end up generating partial paths when that is
really not required, (b) have a rel option like parallel_dml_workers
or use existing option parallel_workers to allow considering parallel
insert for a relation. Any better ideas?

>  If such
> changes are not possible in the short term, like for v14, we should at
> least try to make sure that the eager checking of all partitions is
> only performed if using parallelism is possible at all.
>

As of now, we do first check if it is safe to generate a parallel plan
for underlying select (in Insert into  Select ..) and then perform
parallel-safety checks for the DML. We can postpone it further as
suggested above in (a).

> I will try to take a look at the patches themselves to see if there's
> something I know that will help.
>

Thank you. It will be really helpful if you can do that.

[1] - 
https://www.postgresql.org/message-id/b54f2e306780449093c38cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

-- 
With Regards,
Amit Kapila.




Re: adding wait_start column to pg_locks

2021-01-17 Thread torikoshia

On 2021-01-15 15:23, torikoshia wrote:

Thanks for your reviewing and comments!

On 2021-01-14 12:39, Ian Lawrence Barwick wrote:
Looking at the code, this happens as the wait start time is being 
recorded in
the lock record itself, so always contains the value reported by the 
latest lock

acquisition attempt.


I think you are right and wait_start should not be recorded
in the LOCK.


On 2021-01-15 11:48, Ian Lawrence Barwick wrote:

2021年1月15日(金) 3:45 Robert Haas :


On Wed, Jan 13, 2021 at 10:40 PM Ian Lawrence Barwick
 wrote:

It looks like the logical place to store the value is in the

PROCLOCK

structure; ...


That seems surprising, because there's one PROCLOCK for every
combination of a process and a lock. But, a process can't be waiting
for more than one lock at the same time, because once it starts
waiting to acquire the first one, it can't do anything else, and
thus
can't begin waiting for a second one. So I would have thought that
this would be recorded in the PROC.


Umm, I think we're at cross-purposes here. The suggestion is to note
the time when the process started waiting for the lock in the
process's
PROCLOCK, rather than in the lock itself (which in the original
version
of the patch resulted in all processes with an interest in the lock
appearing
to have been waiting to acquire it since the time a lock acquisition
was most recently attempted).


AFAIU, it seems possible to record wait_start in the PROCLOCK but
redundant since each process can wait at most one lock.

To confirm my understanding, I'm going to make another patch that
records wait_start in the PGPROC.


Attached a patch.

I noticed previous patches left the wait_start untouched even after
it acquired lock.
The patch also fixed it.

Any thoughts?


Regards,

--
Atsushi TorikoshiFrom 62ff3e4dba7d45c260a62a33425cb2d1e6b822c9 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 18 Jan 2021 10:01:35 +0900
Subject: [PATCH v4] To examine the duration of locks, we did join on pg_locks
 and pg_stat_activity and used columns such as query_start or state_change.
 However, since they are the moment when queries have started or their state
 has changed, we could not get the exact lock duration in this way.

This patch adds a new field preserving the time at which locks started
waiting.
---
 contrib/amcheck/expected/check_btree.out |  4 ++--
 doc/src/sgml/catalogs.sgml   | 10 ++
 src/backend/storage/lmgr/lock.c  |  8 
 src/backend/storage/lmgr/proc.c  |  4 
 src/backend/utils/adt/lockfuncs.c|  9 -
 src/include/catalog/pg_proc.dat  |  6 +++---
 src/include/storage/lock.h   |  2 ++
 src/include/storage/proc.h   |  1 +
 src/test/regress/expected/rules.out  |  5 +++--
 9 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 13848b7449..c0aecb0288 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx');
 SELECT * FROM pg_locks
 WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 'bttest_b_idx']::regclass[])
 AND pid = pg_backend_pid();
- locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath 
---+--+--+--+---++---+-+---+--++-+--+-+--
+ locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | wait_start 
+--+--+--+--+---++---+-+---+--++-+--+-+--+
 (0 rows)
 
 COMMIT;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 43d7a1ad90..a5ce0835a9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10589,6 +10589,16 @@ SCRAM-SHA-256$iteration count:
lock table
   
  
+
+ 
+  
+   wait_start timestamptz
+  
+  
+   Lock acquisition wait start time. NULL if
+   lock acquired.
+  
+ 
 

   
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 20e50247ea..5b5fb474e0 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3627,6 +3627,12 @@ GetLockStatusData(void)
 			instance->leaderPid = proc->pid;
 			instance->fastpath = true;
 
+			/*
+			 * Successfully taking fast path lock means there were no
+			 * conflicting locks.
+			 */
+			instance->wait_start = 0;
+
 			el++;
 		}
 
@@ -3654,6 +3660,7 @@ GetLockStatusData(void)
 			instance->pid = proc->pid;
 			instance->leaderPid = proc->pid;
 			

RE: libpq debug log

2021-01-17 Thread kuroda.hay...@fujitsu.com
Dear Tsunakawa-san, Iwata-san,

> * Is setlinebuf() available on Windows?  Maybe you should use setvbuf() 
> instead.

Yeah, cfbot2021 throws errors:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.124922

```
  src/interfaces/libpq/fe-connect.c(6776): warning C4013: 'setlinebuf' 
undefined; assuming extern returning int [C:\projects\postgresql\libpq.vcxproj]
```

The manpage of setlinebuf() suggests how to replace it, so you should follow.

```
   The setbuffer() function is the same, except that the size of the buffer 
is up to the caller, rather  than  being  determined  by  the
   default BUFSIZ.  The setlinebuf() function is exactly equivalent to the 
call:

   setvbuf(stream, NULL, _IOLBF, 0);
```

Hayato Kuroda
FUJITSU LIMITED





Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-17 Thread Tatsuo Ishii
> Pushed. Thanks everyone for the effort put into this patch. The first
> version was sent in 2015, so it took quite a bit of time.

Great news. Thanks everyone who have been working on this.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
Hello Tang-san,

From: Amit Kapila 
> Can we test cases when we have few rows in the Select table (say 1000)
> and there 500 or 1000 partitions. In that case, we won't select
> parallelism but we have to pay the price of checking parallel-safety
> of all partitions. Can you check this with 100, 200, 500, 1000
> partitions table?

I also wanted to see such an extreme(?) case.  The 1,000 rows is not the count 
per partition but the total count of all partitions.e.g., when # of partitions 
is 100, # of rows per partition is 10.



Regards
Takayuki Tsunakawa



Re: Key management with tests

2021-01-17 Thread Robert Haas
On Fri, Jan 15, 2021 at 7:56 PM Andres Freund  wrote:
> I think that's not at all acceptable. I don't mind hashing out details
> on calls / off-list, but the design needs to be public, documented, and
> reviewable.  And if it's something the community can't understand, then
> it can't get in. We're going to have to maintain this going forward.

I agree. If the community is unable to clearly understand what
something is, and why we should have it, then we shouldn't have it --
even if the reason is that we're too dumb to understand, as Bruce
seems to be alleging. I don't really think I believe the theory that
community members by and large are too dumb to understand encryption.
Many features have provoked long and painful discussions about the
design and yet got into the tree in the end with documentation of that
design, and I don't see why that couldn't be done for this one, too. I
think it can and should, and the fact that the work hasn't been done
is one of several blockers for this patch. But even if I'm wrong, and
the real problem is that everyone except the select group of people on
these off-list phone calls are too stupid to understand this, then
that's still a reason not to accept the patch. The code that's in our
source tree is maintained by communal effort, and that means communal
understanding is important.

Frankly, it's more important in this particular case than in some
others. TDE is in great demand, so if it gets into the tree, it's
likely to get a lot of use. The preparatory patches, such as this one,
would at that point be getting a lot of use, too. That means many
people, not just hackers, will have to understand them and answer
questions about them. They are also likely to get a lot of scrutiny
from a security point of view, so we should have a way that we can be
confident that we know why we believe them to be secure. If a security
researcher shows up and says "your stuff is broken," we are not going
to get away with it "no it isn't, because we discussed it on a Friday
call with a closed group of people and decided it was OK." Our
reasoning is going to have to be documented. That doesn't guarantee
that it will be correct, but makes it possible to distinguish between
defects in design, defects in particular parts of the code, and
non-defects, which is otherwise impossible. Meanwhile, even if
security researches are as happy with our TDE implementation as they
could possibly be, a feature that changes the on-disk format can't
erase our ability to solve other problems with the database. Databases
using TDE are still going to have corruption, for example, but now a
corrupted page has a good chance of being completely unreadable rather
than just garbled. You certainly aren't going to be able to just run
pg_filedump on it. I think even if we do a great job explaining to
everybody what impact TDE and its preparatory patches are likely to
have on the system, there's likely to be a lot of cases where users
blame the database for eating their data when the real culprit is the
OS or the hardware, just because such cases are bound to get harder to
investigate, which could have a very negative effect on the
perceptions of PostgreSQL's quality. But if the TDE itself is magic
that only designated smart people on special calls can understand,
then it's going to be far worse, because that'll mean when any kind of
TDE problems comes up, nobody else can help debug anything.

While I would like to have TDE in PostgreSQL, I would not like to have
it on those terms.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> I think it would be good if the parallelism works by default when
> required but I guess if we want to use something on these lines then
> we can always check if the parallel_workers option is non-zero for a
> relation (with RelationGetParallelWorkers). So users can always say
> Alter Table  Set (parallel_workers = 0) if they don't want
> to enable write parallelism for tbl and if someone is bothered that
> this might impact Selects as well because the same option is used to
> compute the number of workers for it then we can invent a second
> option parallel_dml_workers or something like that.

Yes, if we have to require some specification to enable parallel DML, I agree 
that parallel query and parall DML can be separately enabled.  With that said, 
I'm not sure if the user, and PG developers, want to allow specifying degree of 
parallelism for DML.


> > As an aside, (1) and (2) has a potential problem with memory consumption.
> >
> 
> I can see the memory consumption argument for (2) because we might end
> up generating parallel paths (partial paths) for reading the table but
> don't see how it applies to (1)?

I assumed that we would still open all partitions for parallel safety check in 
(1) and (2).  In (1), parallel safety check is done only when parallel DML is 
explicitly enabled by the user.  Just opening partitions keeps 
CacheMemoryContext bloated even after they are closed.


Regards
Takayuki Tsunakawa



Re: NOT VALID for Unique Indexes

2021-01-17 Thread David Fetter
On Thu, Jan 14, 2021 at 04:22:17PM +, Simon Riggs wrote:
> As you may be aware the NOT VALID qualifier currently only applies to
> CHECK and FK constraints, but not yet to unique indexes. I have had
> customer requests to change that.

This is a great feature!

Not exactly on point with this, but in a pretty closely related
context, is there some way we could give people the ability to declare
at their peril that a constraint is valid without incurring the full
scan that VALIDATE currently does? This is currently doable by
fiddling directly with the catalog, which operation is broadly more
dangerous and ill-advised.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Add primary keys to system catalogs

2021-01-17 Thread Robert Haas
On Sat, Oct 3, 2020 at 9:27 AM Craig Ringer
 wrote:
> Frankly I think we really need a way to mark FKs to be DISABLED or NOT 
> CHECKED or something and a way to mark them as NOT VALID. Rsther than 
> expecting uses to fiddle with the implementation triggers. But I don't think 
> FKs on system catalogs require that, it's just be cosmetic there.

Not really. I think the idea that users don't or shouldn't ever do
manual DDL on system catalogs is not very plausible, considering that
we suggest such steps in our own release notes.

I don't have any complaint about labelling some of the unique indexes
as primary keys, but I think installing foreign keys that don't really
enforce anything may lead to confusion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Tomas Vondra




On 1/17/21 6:29 AM, Justin Pryzby wrote:

On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:

diff --git a/doc/src/sgml/ref/create_statistics.sgml 
b/doc/src/sgml/ref/create_statistics.sgml
index 4363be50c3..5b8eb8d248 100644
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -21,9 +21,13 @@ PostgreSQL documentation
  
   

  
+CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
+ON ( expression )
+FROM table_name



  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
  [ ( statistics_kind [, ... ] 
) ]
-ON column_name, column_name [, ...]
+ON { column_name | ( expression ) } [, ...]
  FROM table_name
  


I think this part is wrong, since it's not possible to specify a single column
in form#2.

If I'm right, the current way is:

  - form#1 allows expression statistics of a single expression, and doesn't
allow specifying "kinds";

  - form#2 allows specifying "kinds", but always computes expression statistics,
and requires multiple columns/expressions.

So it'd need to be column_name|expression, column_name|expression [,...]



Strictly speaking you're probably correct - there should be at least two 
elements. But I'm somewhat hesitant about making this more complex, 
because it'll be harder to understand.




@@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] statistics_na
 database and will be owned by the user issuing the command.

  
+  

+   The CREATE STATISTICS command has two basic forms. The
+   simple variant allows building statistics for a single expression, does
+   not allow specifying any statistics kinds and provides benefits similar
+   to an expression index. The full variant allows defining statistics objects
+   on multiple columns and expressions, and selecting which statistics kinds 
will
+   be built. The per-expression statistics are built automatically when there
+   is at least one expression.
+  



+   
+expression
+
+ 
+  The expression to be covered by the computed statistics. In this case
+  only a single expression is required, in which case only the expression
+  statistics kind is allowed. The order of expressions is insignificant.


I think this part is wrong now ?
I guess there's no user-facing expression "kind" left in the CREATE command.
I guess "in which case" means "if only one expr is specified".
"expression" could be either form#1 or form#2.

Maybe it should just say:

+  An expression to be covered by the computed statistics.


Maybe somewhere else, say:

In the second form of the command, the order of expressions is insignificant.




Yeah, this is a leftover from when there was "expressions" kind. I'll 
reword this a bit.



thanks

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Tomas Vondra




On 1/17/21 3:55 AM, Zhihong Yu wrote:

Hi,
+    * Check that only the base rel is mentioned.  (This should be dead code
+    * now that add_missing_from is history.)
+    */
+   if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?



Maybe. The question is whether it really is dead code. It's copied from 
transformIndexStmt so I kept it for now.



For statext_mcv_clauselist_selectivity:

+                   // bms_free(list_attnums[listidx]);
 > The commented line can be removed.



Actually, this should probably do list_free on the list_exprs.


+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool 
*expronleftp)


Better add some comment for examine_clause_args2 since there 
is examine_clause_args() already.




Yeah, this probably needs a better name. I have a feeling this may need 
a refactoring to reuse more of the existing code for the expressions.



+   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?



The typanalyze function (e.g. std_typanalyze) can set it to negative 
value. This is the same check as in examine_attribute, and we need the 
same protections I think.



For serialize_expr_stats():

+   sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+   /* lookup OID of composite type for pg_statistic */
+   typOid = get_rel_type_id(StatisticRelationId);
+   if (!OidIsValid(typOid))
+       ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.



Probably, but this failure is unlikely (should never happen) so I don't 
think this makes any real difference.



+       Datum       values[Natts_pg_statistic];
+       bool        nulls[Natts_pg_statistic];
+       HeapTuple   stup;
+
+       if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.



Nope, that would be against C99.


+           if (enabled[i] == STATS_EXT_NDISTINCT)
+               ndistinct_enabled = true;
+           if (enabled[i] == STATS_EXT_DEPENDENCIES)
+               dependencies_enabled = true;
+           if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'



Yeah, although this just moves existing code. But you're right it could 
use else.



+ReleaseDummy(HeapTuple tuple)
+{
+   pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need 
this method - call pfree in its place.




No, that's not possible because the freefunc callback expects signature

void (*)(HeapTupleData *)

and pfree() does not match that.


thanks

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan  wrote:
> I personally agree with you - it's not like there aren't other ways
> for superusers to crash the server (most of which seem very similar to
> this gist_page_items() issue, in fact). I just think that it's worth
> being clear about that being a trade-off that we've accepted.

Can we rename gist_page_items_bytea() to gist_page_items(), and at the
same time rename the current gist_page_items() -- perhaps call it
gist_page_items_output()?

That way we could add a bt_page_items_output() function later, while
leaving everything consistent (actually not quite, since
bt_page_items() outputs text instead of bytea -- but that seems worth
fixing too). This also has the merit of making the unsafe "output"
variant into the special case.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 2:52 PM Heikki Linnakangas  wrote:
> I'm not sure I understand. It's true that the raw page image can contain
> data from a different index, or any garbage really. And the function
> will behave badly if you do that. That's an accepted risk with
> pageinspect functions, that's why they're superuser-only, although some
> of them are more tolerant of corrupt pages than others. The
> gist_page_items_bytea() variant doesn't try to parse the key data and is
> less likely to crash on bad input.

I personally agree with you - it's not like there aren't other ways
for superusers to crash the server (most of which seem very similar to
this gist_page_items() issue, in fact). I just think that it's worth
being clear about that being a trade-off that we've accepted.

-- 
Peter Geoghegan




Re: Yet another fast GiST build

2021-01-17 Thread Heikki Linnakangas

On 18/01/2021 00:35, Peter Geoghegan wrote:

On Sun, Jan 17, 2021 at 12:50 PM Tom Lane  wrote:

I noticed that gist_page_items() thinks it can hold inter_call_data->rel
open across a series of calls.  That's completely unsafe: the executor
might not run the call series to completion (see LIMIT), resulting in
relcache leak complaints.


Fixed, thanks! I changed it to return a tuplestore.


It also has the potential to run into big problems should the user
input a raw page image with an regclass-argument-incompatible tuple
descriptor. Maybe that's okay (this is a tool for experts), but it
certainly is a consideration.


I'm not sure I understand. It's true that the raw page image can contain 
data from a different index, or any garbage really. And the function 
will behave badly if you do that. That's an accepted risk with 
pageinspect functions, that's why they're superuser-only, although some 
of them are more tolerant of corrupt pages than others. The 
gist_page_items_bytea() variant doesn't try to parse the key data and is 
less likely to crash on bad input.


- Heikki




Re: WIP: System Versioned Temporal Table

2021-01-17 Thread Vik Fearing
On 1/17/21 5:46 PM, Surafel Temesgen wrote:
> On Sat, Jan 16, 2021 at 10:12 PM Vik Fearing 
> wrote:
> 
>>
>> I haven't looked at this patch in a while, but I hope that ALTER TABLE t
>> ADD SYSTEM VERSIONING is not adding any columns.  That is a bug if it does.
>>
>>
> Yes, that is how I implement it. I don't understand how it became a bug?

This is not good, and I see that DROP SYSTEM VERSIONING also removes
these columns which is even worse.  Please read the standard that you
are trying to implement!

I will do a more thorough review of the functionalities in this patch
(not necessarily the code) this week.
-- 
Vik Fearing




Re: Yet another fast GiST build

2021-01-17 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 12:50 PM Tom Lane  wrote:
> I noticed that gist_page_items() thinks it can hold inter_call_data->rel
> open across a series of calls.  That's completely unsafe: the executor
> might not run the call series to completion (see LIMIT), resulting in
> relcache leak complaints.

It also has the potential to run into big problems should the user
input a raw page image with an regclass-argument-incompatible tuple
descriptor. Maybe that's okay (this is a tool for experts), but it
certainly is a consideration.

-- 
Peter Geoghegan




Re: pg_collation_actual_version() ERROR: cache lookup failed for collation 123

2021-01-17 Thread Thomas Munro
On Mon, Jan 18, 2021 at 10:59 AM Justin Pryzby > |postgres=# SELECT
pg_collation_actual_version(123);
> |ERROR:  cache lookup failed for collation 123

Yeah, not a great user experience.  Will fix next week; perhaps
get_collation_version_for_oid() needs missing_ok and found flags, or
something like that.

I'm also wondering if it would be better to name that thing with
"current" rather than "actual".




Re: Pg14, pg_dumpall and "password_encryption=true"

2021-01-17 Thread Noah Misch
On Sun, Jan 17, 2021 at 01:51:35PM +0100, Magnus Hagander wrote:
> On Sat, Jan 16, 2021 at 8:27 AM Noah Misch  wrote:
> > On Fri, Jan 15, 2021 at 01:35:50PM +0900, Ian Lawrence Barwick wrote:
> > > $ tail -3 pg_upgrade_utility.log
> > > ALTER ROLE "postgres" SET "password_encryption" TO 'true';
> > > psql:pg_upgrade_dump_globals.sql:75: ERROR:  invalid value for
> > > parameter "password_encryption": "true"
> > > HINT:  Available values: md5, scram-sha-256.
> > >
> > > This is a consquence of commit c7eab0e97, which removed support for the
> > > legacy
> > > values "on"/"true"/"yes"/"1".
> >
> > Right.  This can happen anytime we reduce the domain of a setting having
> > GucContext PGC_SU_BACKEND, PGC_BACKEND, PGC_SUSET, or PGC_USERSET.
> >
> > > I see following options to resolve this issue.
> > >
> > > 1. Re-add support for a "true" as an alias for "md5".
> > > 2. Transparently rewrite "true" to "md5"
> > > 3. Have pg_dumpall map "true" to "md5"
> > > 4. Add an option to pg_dumpall to specify the target version
> >
> > I expect rather few databases override this particular setting in
> > pg_proc.proconfig or pg_db_role_setting.setconfig.  The restore failure has 
> > a
> > clear error message, which is enough.  Each of the above would be overkill.
> 
> Option 3 would be the closest to how other things work though,
> wuodln't it? Normally, it's the job of pg_dump (or pg_dumpall) to
> adapt the dump to the new version of PostgreSQL.

I didn't do a precedent search, but I can't think of an instance of those
programs doing (3).  We have things like guessConstraintInheritance() that
make up for lack of information in old versions, but dumpFunc() doesn't mutate
any proconfig setting values.  Is there a particular pg_dump behavior you had
in mind?

> And pg_dump[all] always generates a dump that should reload in the
> same version of postgres as the dump program -- not the source
> postgresql. Thus doing (4) would mean changing that, and would have to
> teach pg_dump[all] about *all* version differences in all directions
> -- which would be a huge undertaking. Doing that for just one
> individual parameter would be very strange.

Agreed.

> In a lot of ways,  think (3) seems like the reasonable ting to do.
> That's basically what we do when things change in the table creation
> commands etc, so it seems like the logical place.

This would be interpreting setconfig='{password_encryption=on}' as "opt out of
future password security increases".  I expect that will tend not to match the
intent of the person entering the setting.  That said, if v14 were already
behaving this way, I wouldn't dislike it enough to complain.




Re: Add primary keys to system catalogs

2021-01-17 Thread Tom Lane
Peter Eisentraut  writes:
> [ v2-0001-Add-primary-keys-and-unique-constraints-to-system.patch ]

I've reviewed this patch.  It looks pretty solid to me, with a couple
trivial nits as mentioned below, and one bigger thing that's perhaps
in the category of bikeshedding.  Namely, do we really want to prefer
using the OID indexes as the primary keys?  In most cases there's some
other index that seems to me to be what a user would think of as the
pkey, for example pg_class_relname_nsp_index for pg_class or
pg_proc_proname_args_nsp_index for pg_proc.  Preferring OID where it
exists is a nice simple rule, which has some attractiveness, but the
OID indexes seem to me like a lookup aid rather than the "real" object
identity.

> There is something strange going on in the misc_sanity test, as seen by 
> the test output change
> -NOTICE:  pg_constraint contains unpinned initdb-created object(s)

Formerly the lowest OID in pg_constraint was that of the
information_schema.cardinal_number_domain_check CHECK constraint,
which is intentionally not pinned.  Now the lowest OID is that of
pg_proc_oid_index, which is pinned, hence the output change.  I think
that's fine.  The idea here is just to notice whether any catalogs have
been missed by setup_depend, and we have now proven that pg_constraint
wasn't missed, so it's cool.

The other catalog that is listed in the expected output, but is not
one of the ones intentionally excluded by setup_depend, is pg_rewrite.
That's because its lowest OID is a rewrite rule for a system view, which
we've intentionally made non-pinned.  So that's also fine, but maybe
it'd be worth explaining it in the comment above the DO block.

Anyway, on to the minor nits:

system_constraints.sql should be removed by the maintainer-clean target
in src/backend/catalog/Makefile; perhaps also mention it in the comment
for the clean target.  Also I suppose src/tools/msvc/clean.bat needs to
remove it, like it does postgres.bki.

The contents of system_constraints.sql seem pretty randomly ordered,
and I bet the order isn't stable across machines.  It would be wise
to sort the entries by name to ensure we don't get inconsistencies
between different builds.  (If nothing else, that could complicate
validating tarballs.)

I don't follow why the pg_seclabel, pg_shseclabel indexes aren't made
into pkeys?  There's a comment claiming they "use a nondefault operator
class", but that's untrue AFAICS.

regards, tom lane




pg_collation_actual_version() ERROR: cache lookup failed for collation 123

2021-01-17 Thread Justin Pryzby
As of 257836a75, this returns:

|postgres=# SELECT pg_collation_actual_version(123);
|ERROR:  cache lookup failed for collation 123
|postgres=# \errverbose 
|ERROR:  XX000: cache lookup failed for collation 123
|LOCATION:  get_collation_version_for_oid, pg_locale.c:1754

I'm of the impression that's considered to be a bad behavior for SQL accessible
functions.

In v13, it did the same thing but with different language:

|ts=# SELECT pg_collation_actual_version(123);
|ERROR:  collation with OID 123 does not exist
|ts=# \errverbose 
|ERROR:  42704: collation with OID 123 does not exist
|LOCATION:  pg_collation_actual_version, collationcmds.c:367

-- 
Justin




Re: pgbench: option delaying queries till connections establishment?

2021-01-17 Thread Thomas Munro
On Sat, Jan 9, 2021 at 8:13 AM Thomas Munro  wrote:
> On Sun, Jan 3, 2021 at 9:49 AM Fabien COELHO  wrote:
> > > Just for fun, the attached 0002 patch is a quick prototype of a
> > > replacement for that stuff that seems to work OK on a Mac here.  (I'm
> > > not sure if the Windows part makes sense or works.)
> >
> > Thanks! That will definitely help because I do not have a Mac. I'll do
> > some cleanup.
>
> I think the main things to clean up are:

I’m supposed to be on vacation this week, but someone left a shiny new
Arm Mac laptop near me, so here’s a cleaned up version.

> 1.  pthread_barrier_init() should check for errors from
> pthread_cond_init() and pthread_mutex_init(), and return -1.

Done.

> 2.  pthread_barrier_destroy() should call pthread_cond_destroy() and
> pthread_mutex_destroy().

Done.

> 3 . Decide if it's sane for the Windows-based emulation to be in here
> too, or if it should stay in pgbench.c.  Or alternatively, if we're
> emulating pthread stuff on Windows, why not also put the other pthread
> emulation stuff from pgbench.c into a "ports" file; that seems
> premature and overkill for your project.  I dunno.

I decided to solve only the macOS problem for now.  So in this
version, the A and B patches are exactly as you had them in your v7,
except that B includes “port/pg_pthread.h” instead of .

Maybe it’d make sense to move the Win32 pthread emulation stuff out of
pgbench.c into src/port too (the pre-existing stuff, and the new
barrier stuff you added), but that seems like a separate patch, one
that I’m not best placed to write, and it’s not clear to me that we’ll
want to be using pthread APIs as our main abstraction if/when thread
usage increases in the PG source tree anyway.  Other opinions welcome.

> 4.  cfbot shows that it's not building on Windows because
> HAVE_PTHREAD_BARRIER_WAIT is missing from Solution.pm.

Fixed, I think.


v8-0001-A.patch
Description: Binary data


v8-0002-Add-missing-pthread_barrier_t.patch
Description: Binary data


v8-0003-B.patch
Description: Binary data


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-17 Thread Tomas Vondra

On 1/16/21 11:18 PM, Tomas Vondra wrote:

...

>
Thanks for the updated patch, this version looks OK to me - I've marked 
it as RFC. I'll do a bit more testing, review, and then I'll get it 
committed.




Pushed. Thanks everyone for the effort put into this patch. The first 
version was sent in 2015, so it took quite a bit of time.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Yet another fast GiST build

2021-01-17 Thread Tom Lane
Heikki Linnakangas  writes:
> On 12/01/2021 18:19, Andrey Borodin wrote:
>> Thanks! Looks good to me.

> Pushed, thanks!

I noticed that gist_page_items() thinks it can hold inter_call_data->rel
open across a series of calls.  That's completely unsafe: the executor
might not run the call series to completion (see LIMIT), resulting in
relcache leak complaints.  I suspect that it might have cache-flush
hazards even without that.  I think this code needs to be rewritten to do
all the interesting work in the first call.  Or maybe better, return the
results as a tuplestore so you don't have to do multiple calls at all.

regards, tom lane




Re: evtfoid and evtowner missing in findoidjoins/README

2021-01-17 Thread Tom Lane
"Joel Jacobson"  writes:
> A bit of background:
> I'm working on an extension where I need SQL access to this reference
> information.  My extension is successfully automatically helping me to
> find problems in extension update-scripts, where an update from a
> version to a version would give a different result than directly
> installing the to-version from scratch.

> Currently, I'm parsing findoidjoins/README and importing the "from
> column" and "to column" to a lookup-table, which is used by my
> extension.

Hmm.  That README was certainly never intended to be used that way ;-)

> So I think adding this as a lookup table in pg_catalog is the best solution,
> since extension writers could then use this information in various ways.

I'm definitely -1 on adding a catalog for that.  But it seems like the
idea of not-really-enforced FK entries in pg_constraint would serve your
purposes just as well (and it'd be better from the standpoint of getting
the planner to be aware of these relationships).

> The information is theoretically already available via catalogs.sgml,
> but a) it's not easy to parse, and b) it's not available from SQL.

Well, SGML is actually plenty easy to parse as long as you've got xml
tooling at hand.  We'd never want to introduce such a dependency in the
normal build process, but making something like findoidjoins depend on
such tools seems within reason.  I recall having whipped up some one-off
Perl scripts of that sort when I was doing that massive rewrite of the
func.sgml tables last year.  I didn't save them though, and in any case
I'm the world's worst Perl programmer, so it'd be better for someone
with more Perl-fu to take point if we decide to go that route.

regards, tom lane




Re: evtfoid and evtowner missing in findoidjoins/README

2021-01-17 Thread Joel Jacobson
On Sun, Jan 17, 2021, at 18:16, Tom Lane wrote:
> I kind of wonder whether findoidjoins has outlived its purpose and
> we should just remove it (along with the oidjoins test script).

A bit of background:
I'm working on an extension where I need SQL access to this reference 
information.
My extension is successfully automatically helping me to find problems in 
extension update-scripts,
where an update from a version to a version would give a different result than 
directly installing the to-version from scratch.

Currently, I'm parsing findoidjoins/README and importing the "from column" and 
"to column"
to a lookup-table, which is used by my extension.

If findoidjoins is removed, I would be happy as long as this reference 
information
continues to be provided in some other simple machine-readable way,
like a CSV-file somewhere in the repo, or even better: making this information
available from SQL via a new lookup-table in pg_catalog.

I can see how parsing catalogs.sgml would be doable,
but proper SGML parsing is quite complex since it's a recursive language,
and can't be reliably parsed with e.g. simple regexes.

So I think adding this as a lookup table in pg_catalog is the best solution,
since extension writers could then use this information in various ways.

The information is theoretically already available via catalogs.sgml,
but a) it's not easy to parse, and b) it's not available from SQL.

Are there any other hackers who ever wished they would have had SQL
access to these catalog references?

If desired by enough others, perhaps something along these lines could work?

CREATE TABLE pg_catalog.pg_references (
colfrom text,
colto text,
UNIQUE (colfrom)
);

Where "colfrom" would be e.g. "pg_catalog.pg_class.relfilenode"
and "colto" would be "pg_catalog.pg_class.oid" for that example.

Not sure about the column names "colfrom"/"colto" though,
since the abbreviation for columns seems to be "att" in the pg_catalog context.

/Joel

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-17 Thread Peter Geoghegan
On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila  wrote:
> Do we do this optimization (bottom-up deletion) only when the last
> item which can lead to page split has indexUnchanged set to true? If
> so, what if the last tuple doesn't have indexUnchanged but the
> previous ones do have?

Using the indexUnchanged hint as a triggering condition makes sure
that non-HOT updaters are the ones that pay the cost of a bottom-up
deletion pass. We create backpressure for non-HOT updaters (in indexes
that are logically unchanged) specifically. (Of course it's also true
that the presence of the indexUnchanged hint is highly suggestive of
there being more version churn duplicates on the leaf page already,
which is not actually certain.)

It's possible that there will be some mixture of inserts and non-hot
updates on the same leaf page, and as you say the implementation might
fail to do a bottom-up pass based on the fact that an incoming item at
the point of a would-be page split was a plain insert (and so didn't
receive the hint). The possibility of these kinds of "missed
opportunities" are okay because a page split was inevitable either
way. You can imagine a case where that isn't true (i.e. a missed
opportunity to avoid a page split), but that's kind of like imagining
a fair coin flip taking place 100 times and coming up heads each time.
It just isn't realistic for such an "mixed tendencies" leaf page to
stay in flux (i.e. not ever split) forever, with the smartest
triggering condition in the world -- it's too unstable.

An important part of understanding the design is to imagine things at
the leaf page level, while thinking about how what that actually looks
like differs from an ideal physical representation of the same leaf
page that is much closer to the logical contents of the database.
We're only interested in leaf pages where the number of logical rows
is basically fixed over time (when there is version churn). Usually
this just means that there are lots of non-HOT updates, but it can
also work with lots of queue-like inserts and deletes in a unique
index.

Ultimately, the thing that determines whether or not the bottom-up
deletion optimization is effective for any given leaf page is the fact
that it actually prevents the page from splitting despite lots of
version churn -- this happens again and again. Another key concept
here is that bottom-up deletion passes for the same leaf page are
related in important ways -- it is often a mistake to think of
individual bottom-up passes as independent events.

> Why are we using terminology bottom-up deletion and why not simply
> duplicate version deletion or something on those lines?

Why is that simpler? Also, it doesn't exactly delete duplicates. See
my later remarks.

> How do we distinguish between version duplicate tuples (added due to
> the reason that some other index column is updated) or duplicate
> tuples (added because a user has inserted a row with duplicate value)
> for the purpose of bottom-up deletion?  I think we need to distinguish
> between them because in the earlier case we can remove the old version
> of the tuple in the index but not in later. Or is it the case that
> indexAm doesn't differentiate between them but relies on heapAM to
> give the correct answer?

Bottom-up deletion uses the presence of duplicates as a starting point
for determining which heap pages to visit, based on the assumption
that at least some are obsolete versions. But heapam.c has additional
heap-level heuristics that help too.

It is quite possible and even likely that we will delete some
non-duplicate tuples in passing, just because they're checked in
passing -- they might turn out to be deletable, for whatever reason.
We're also concerned (on the heapam.c side) about which heap pages
have the most TIDs (any kind of TID, not just one marked promising in
index AM), so this kind of "serendipity" is quite common in practice.
Often the total number of heap pages that are pointed to by all index
tuples on the page just isn't that many (8 - 10). And often cases with
lots of HOT pruning can have hundreds of LP_DEAD item pointers on a
heap page, which we'll tend to get to before too long anyway (with or
without many duplicates).

The nbtree/caller side makes inferences about what is likely to be
true about the "logical contents" of the physical leaf page, as a
starting point for the heuristic-driven search for deletable index
tuples. There are various ways in which each inference (considered
individually) might be wrong, including the one that you pointed out:
inserts of duplicates will look like update version churn. But that
particular issue won't matter if the inserted duplicates are on
mostly-distinct heap blocks (which is likely), because then they'll
only make a noise level contribution to the behavior in heapam.c.
Also, we can fall back on deduplication when bottom-up deletion fails,
which will merge together the duplicates-from-insert, so now any
future bottom-up deletion pass 

Re: cgit view availabel

2021-01-17 Thread Chapman Flack
On 01/17/21 08:48, Magnus Hagander wrote:
> I've finally gotten around to putting up a
> cgit instance on our git server, to allow for browsing of the git
> repositories. You can find this at:
> 
> https://git.postgresql.org/cgit/

Interesting!

Having never actively compared gitweb and cgit, what are the nicest
functional benefits I should be looking for?

Regards,
-Chap




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-17 Thread Zhihong Yu
Hi,

This patch introduces new function postgres_fdw_disconnect() when
called with a foreign server name discards the associated
connections with the server name.

I think the following would read better:

This patch introduces *a* new function postgres_fdw_disconnect(). When
called with a foreign server name, it discards the associated
connections with the server.

Please note the removal of the 'name' at the end - connection is with
server, not server name.

+   if (is_in_use)
+   ereport(WARNING,
+   (errmsg("cannot close the connection because it is
still in use")));

It would be better to include servername in the message.

+   ereport(WARNING,
+   (errmsg("cannot close all connections because some
of them are still in use")));

I think showing the number of active connections would be more informative.
This can be achieved by changing active_conn_exists from bool to int (named
active_conns, e.g.):

+   if (entry->conn && !active_conn_exists)
+   active_conn_exists = true;

Instead of setting the bool value, active_conns can be incremented.

Cheers

On Sat, Jan 16, 2021 at 11:39 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Sat, Jan 16, 2021 at 10:36 AM Bharath Rupireddy
>  wrote:
> > > > Please consider the v9 patch set for further review.
> > >
> > > Thanks for updating the patch! I reviewed only 0001 patch.
>
> I addressed the review comments and attached v10 patch set. Please
> consider it for further review.
>
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: cgit view availabel

2021-01-17 Thread Magnus Hagander
On Sun, Jan 17, 2021 at 5:46 PM Victor Yegorov  wrote:
>
> вс, 17 янв. 2021 г. в 17:19, Magnus Hagander :
>>
>> > First thing I've noted:
>> >
>> >
>> > https://git.postgresql.org/cgit/postgresql.git/commit/960869da0803427d14335bba24393f414b476e2c
>> >
>> > silently shows another commit.
>>
>> Where did you get that URL from?
>
>
> I've made it up manually, comparing cgit and gitweb links.
>
>
>>
>> And AFAICT, and URL like that in cgit shows the latest commit in the
>> repo, for the path that you entered (which in this case is the hash
>> put int he wrong place).
>
>
> Yes, that's what I've noted too.
>
>
>> I guess we could capture a specific "looks like a hash" and redirect
>> that, assuming we would never ever have anything in a path or filename
>> in any of our repositories that looks like a hash. That seems like
>> maybe it's a bit of a broad assumption?
>
>
> I thought maybe it's possible to rewrite requests in a form:
>
> /cgit/*/commit/*
>
> into
>
> /cgit/*/commit/?id=&

That would break any repository that has a directory called "commit"
in it, wouldn't it?

That said we might be able to pick it up as a top level entry only,
because those subdirs would be expected to be under /tree/*/commit/*.

But we could also not do /cgit//commit/* -- for example
https://git.postgresql.org/cgit/postgresql.git/commit/src/backend/tcop/postgres.c?id=960869da0803427d14335bba24393f414b476e2c
is a perfectly valid url to show the part of the patch that affects
just this one part of the path.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-17 Thread Tom Lane
Alexander Lakhin  writes:
> select * from test_regex(repeat('(x)', 32), 'a', 'c');
> leads to
> ==00:00:00:05.736 2605072== Invalid write of size 4

Indeed, silly oversight on my part.  Fixed, thanks!

regards, tom lane




Re: evtfoid and evtowner missing in findoidjoins/README

2021-01-17 Thread Tom Lane
"Joel Jacobson"  writes:
> The below references are already properly documented in
>https://www.postgresql.org/docs/current/catalog-pg-event-trigger.html
> but missing in src/tools/findoidjoins/README.
> Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid
> Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid

Yup, no surprise given the way findoidjoins works: it could only
discover those relationships if pg_event_trigger had some entries in
the end state of the regression database.  Of course it doesn't, and
I'd be against leaving a live event trigger in place in that DB.
(I suspect there are other similar gaps in the coverage.)

I kind of wonder whether findoidjoins has outlived its purpose and
we should just remove it (along with the oidjoins test script).
IMO it was intended to find mistakes in the initial catalog data,
but given the current policy that the .dat files shall not contain
numeric OID references, that type of mistake is impossible anymore.
Certainly, it's been so long since that test script has caught
anything that it doesn't seem worth the annual-or-so maintenance
effort to update it.

A different line of thought would be to try to teach findoidjoins
to scrape info about catalog references out of catalogs.sgml, and
use that instead of or in addition to its current methods.  That
seems like a fair amount of work though, and again I can't get
excited that it'd be worth the trouble.

Also, I recall mutterings on -hackers about adding foreign-key
entries to pg_constraint to document the catalog reference
relationships.  (In my possibly-faulty recollection, the idea
was that these'd only be documentation and would lack enforcement
triggers; but perhaps we'd allow the planner to trust them for
purposes of optimizing multi-catalog queries.)  If we had those,
we could make findoidjoins use them instead of trawling the data,
or maybe throw away findoidjoins per se and let the oidjoins.sql
script read the FK entries to find out what to check dynamically.

regards, tom lane




Re: Printing backtrace of postgres processes

2021-01-17 Thread vignesh C
On Sat, Jan 16, 2021 at 11:10 PM Andres Freund  wrote:
>
> Hi,
>
> On Sat, Jan 16, 2021, at 09:34, vignesh C wrote:
> > On Sat, Jan 16, 2021 at 1:40 AM Andres Freund  wrote:
> > >
> > > On 2021-01-15 09:53:05 +0100, Peter Eisentraut wrote:
> > > > On 2020-12-08 10:38, vignesh C wrote:
> > > > > I have implemented printing of backtrace based on handling it in
> > > > > CHECK_FOR_INTERRUPTS. This patch also includes the change to allow
> > > > > getting backtrace of any particular process based on the suggestions.
> > > > > Attached patch has the implementation for the same.
> > > > > Thoughts?
> > > >
> > > > Are we willing to use up a signal for this?
> > >
> > > Why is a full signal needed? Seems the procsignal infrastructure should
> > > suffice?
> >
> > Most of the processes have access to ProcSignal, for these processes
> > printing of callstack signal was handled by using ProcSignal. Pgstat
> > process & syslogger process do not have access to ProcSignal,
> > multiplexing with SIGUSR1 is not possible for these processes. So I
> > handled the printing of callstack for pgstat process & syslogger using
> > the SIGUSR2 signal.
> > This is because shared memory is detached before pgstat & syslogger
> > process is started by using the below:
> > /* Drop our connection to postmaster's shared memory, as well */
> > dsm_detach_all();
> > PGSharedMemoryDetach();
>
> Sure. But why is it important enough to support those that we are willing to 
> dedicate a signal to the task? Their backtraces aren't often interesting, so 
> I think we should just ignore them here.

Thanks for your comments Andres, I will ignore it for the processes
which do not have access to ProcSignal. I will make the changes and
post a patch for this soon.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: list of extended statistics on psql

2021-01-17 Thread Justin Pryzby
On Sun, Jan 17, 2021 at 03:31:57PM +0100, Tomas Vondra wrote:
> I've reverted the commit - once we find the right way to handle this, I'll
> get it committed again.

Please consider these doc changes for the next iteration.

commit 1a69f648ce6c63ebb37b6d8ec7c6539b3cb70787
Author: Justin Pryzby 
Date:   Sat Jan 16 17:47:35 2021 -0600

doc review: psql \dX 891a1d0bca262ca78564e0fea1eaa5ae544ea5ee

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index aaf55df921..a678a69dfb 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1928,15 +1928,15 @@ testdb=
 is specified, only those extended statistics whose names match the
 pattern are listed.
 If + is appended to the command name, each extended
-statistics is listed with its size.
+statistic is listed with its size.
 
 
 
-The column of the kind of extended stats (e.g. Ndistinct) shows some 
statuses.
+The column of the kind of extended stats (e.g. Ndistinct) shows its 
status.
 "requested" means that it needs to collect statistics by ANALYZE. 
 "built" means ANALYZE was 
-finished, and the planner can use it. NULL means that it doesn't 
exists. 
+run, and statistics are available to the planner. NULL means that it 
doesn't exist. 
 
 
   




Re: WIP: System Versioned Temporal Table

2021-01-17 Thread Surafel Temesgen
On Sat, Jan 16, 2021 at 10:12 PM Vik Fearing 
wrote:

>
> I haven't looked at this patch in a while, but I hope that ALTER TABLE t
> ADD SYSTEM VERSIONING is not adding any columns.  That is a bug if it does.
>
>
Yes, that is how I implement it. I don't understand how it became a bug?

regards
Surafel


Re: cgit view availabel

2021-01-17 Thread Victor Yegorov
вс, 17 янв. 2021 г. в 17:19, Magnus Hagander :

> > First thing I've noted:
> >
> >
> https://git.postgresql.org/cgit/postgresql.git/commit/960869da0803427d14335bba24393f414b476e2c
> >
> > silently shows another commit.
>
> Where did you get that URL from?
>

I've made it up manually, comparing cgit and gitweb links.



> And AFAICT, and URL like that in cgit shows the latest commit in the
> repo, for the path that you entered (which in this case is the hash
> put int he wrong place).
>

Yes, that's what I've noted too.


I guess we could capture a specific "looks like a hash" and redirect
> that, assuming we would never ever have anything in a path or filename
> in any of our repositories that looks like a hash. That seems like
> maybe it's a bit of a broad assumption?
>

I thought maybe it's possible to rewrite requests in a form:

/cgit/*/commit/*

into

/cgit/*/commit/?id=&

?

-- 
Victor Yegorov


evtfoid and evtowner missing in findoidjoins/README

2021-01-17 Thread Joel Jacobson
The below references are already properly documented in

   https://www.postgresql.org/docs/current/catalog-pg-event-trigger.html

but missing in src/tools/findoidjoins/README.

Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid
Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid

I'm not sure what the process of updating the README is,
the git log seems to indicate this is usually part of the release cycle,
so perhaps it's OK this file is out-of-sync between releases?

But if so, that won't explain these two, since they have been around for ages.

/Joel

Re: cgit view availabel

2021-01-17 Thread Magnus Hagander
On Sun, Jan 17, 2021 at 5:00 PM Victor Yegorov  wrote:
>
> вс, 17 янв. 2021 г. в 14:48, Magnus Hagander :
>>
>> After a short time (ahem, several years) of badgering of me my a
>> certain community member, I've finally gotten around to putting up a
>> cgit instance on our git server, to allow for browsing of the git
>> repositories. You can find this at:
>>
>> https://git.postgresql.org/cgit/
>>
>> or specifically for the postgresql git repo:
>>
>> https://git.postgresql.org/cgit/postgresql.git/
>
>
> Looks nice!
>
> First thing I've noted:
>
>
> https://git.postgresql.org/cgit/postgresql.git/commit/960869da0803427d14335bba24393f414b476e2c
>
> silently shows another commit.

Where did you get that URL from?

And AFAICT, and URL like that in cgit shows the latest commit in the
repo, for the path that you entered (which in this case is the hash
put int he wrong place).

> Is it possible to make the scheme above work?
> Our gitweb (and also github) is using it, so I assume people are quite used 
> to it.

I guess we could capture a specific "looks like a hash" and redirect
that, assuming we would never ever have anything in a path or filename
in any of our repositories that looks like a hash. That seems like
maybe it's a bit of a broad assumption?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: cgit view availabel

2021-01-17 Thread Victor Yegorov
вс, 17 янв. 2021 г. в 14:48, Magnus Hagander :

> After a short time (ahem, several years) of badgering of me my a
> certain community member, I've finally gotten around to putting up a
> cgit instance on our git server, to allow for browsing of the git
> repositories. You can find this at:
>
> https://git.postgresql.org/cgit/
>
> or specifically for the postgresql git repo:
>
> https://git.postgresql.org/cgit/postgresql.git/
>

Looks nice!

First thing I've noted:


https://git.postgresql.org/cgit/postgresql.git/commit/960869da0803427d14335bba24393f414b476e2c

silently shows another commit.

Is it possible to make the scheme above work?
Our gitweb (and also github) is using it, so I assume people are quite used
to it.



-- 
Victor Yegorov


Re: POC: Cleaning up orphaned files using undo logs

2021-01-17 Thread Dmitry Dolgov
> On Fri, Dec 04, 2020 at 10:22:42AM +0100, Antonin Houska wrote:
> Amit Kapila  wrote:
>
> > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > > >
> > > > If you want to track at undo record level, then won't it lead to
> > > > performance overhead and probably additional WAL overhead considering
> > > > this action needs to be WAL-logged. I think recording at page-level
> > > > might be a better idea.
> > >
> > > I'm not worried about WAL because the undo execution needs to be 
> > > WAL-logged
> > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs 
> > > to be
> > > evaluated regarding performance is the (exclusive) locking of the page 
> > > that
> > > carries the progress information.
> > >
> >
> > That is just for one kind of smgr, think how you will do it for
> > something like zheap. Their idea is to collect all the undo records
> > (unless the undo for a transaction is very large) for one zheap-page
> > and apply them together, so maintaining the status at each undo record
> > level will surely lead to a large amount of additional WAL. See below
> > how and why we have decided to do it differently.
> >
> > > I'm still not sure whether this info should
> > > be on every page or only in the chunk header. In either case, we have a
> > > problem if there are two or more chunks created by different transactions 
> > > on
> > > the same page, and if more than on of these transactions need to perform
> > > undo. I tend to believe that this should happen rarely though.
> > >
> >
> > I think we need to maintain this information at the transaction level
> > and need to update it after processing a few blocks, at least that is
> > what was decided and implemented earlier. We also need to update it
> > when the log is switched or all the actions of the transaction were
> > applied. The reasoning is that for short transactions it won't matter
> > and for larger transactions, it is good to update it after a few pages
> > to avoid WAL and locking overhead. Also, it is better if we collect
> > the undo in bulk, this is proved to be beneficial for large
> > transactions.
>
> Attached is what I originally did not include in the patch series, see the
> part 0012. I have no better idea so far. The progress information is stored in
> the chunk header.
>
> To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function
> can be modified so it recognizes when it's necessary to update the progress
> info. Also the user (zheap) should think when it should call the function.
> Since I've included 0012 now as a prerequisite for discarding (0013),
> currently it's only necessary to update the progress at undo log chunk
> boundary.
>
> In this version of the patch series I wanted to publish the remaining ideas I
> haven't published yet.

Thanks for the updated patch. As I've mentioned off the list I'm slowly
looking through it with the intent to concentrate on undo progress
tracking. But before I will post anything I want to mention couple of
strange issues I see, otherwise I will forget for sure. Maybe it's
already known, but running several times 'make installcheck' against a
freshly build postgres with the patch applied from time to time I
observe various errors.

This one happens on a crash recovery, seems like
UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
the replay process:

TRAP: FailedAssertion("page_offset + this_page_bytes <= 
uph->ud_insertion_point", File: "undopage.c", Line: 300)
postgres: startup recovering 
00010012(ExceptionalCondition+0xa1)[0x558b38b8a350]
postgres: startup recovering 
00010012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
postgres: startup recovering 
00010012(UndoReplay+0xa1d)[0x558b38766f32]
postgres: startup recovering 
00010012(XactUndoReplay+0x77)[0x558b38769281]
postgres: startup recovering 
00010012(smgr_redo+0x1af)[0x558b387aa7bd]

This one is somewhat similar:

TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: 
"undopage.c", Line: 287)
postgres: undo worker for database 36893 
(ExceptionalCondition+0xa1)[0x5559c90f1350]
postgres: undo worker for database 36893 
(UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
postgres: undo worker for database 36893 
(UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

There are also here and there messages about not found undo files:

ERROR:  cannot open undo segment file 'base/undo/08.02': No 
such file or directory
WARNING:  failed to undo transaction

I haven't found out the trigger yet, but got an impression that it
happens after create_table tests.




Re: Online checksums patch - once again

2021-01-17 Thread Magnus Hagander
On Fri, Jan 15, 2021 at 11:33 AM Daniel Gustafsson  wrote:
>
> The attached v30 adds the proposed optimizations in this thread as previously
> asked for, as well as some small cleanups to the procsignal calling codepath
> (replacing single call functions with just calling the function) and some
> function comments which were missing.

I've applied the docs patch.

I made a tiny change so the reference to "data page checksums" was
changed to "data checksums". Of course, after doing that I realize
that we use both terms in different places, but the docs side mostly
talked about "data checksums". I changed the one reference that was
also in wal.sgml, but left the rest alone.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: list of extended statistics on psql

2021-01-17 Thread Tomas Vondra




On 1/17/21 3:01 AM, Tomas Vondra wrote:

On 1/17/21 2:41 AM, Shinoda, Noriyoshi (PN Japan FSIP) wrote:

Hi, hackers.

I tested this committed feature.
It doesn't seem to be available to non-superusers due to the inability 
to access pg_statistics_ext_data.

Is this the expected behavior?



Hmmm, that's a good point. Bummer we haven't noticed that earlier.

I wonder what the right fix should be - presumably we could do something 
like pg_stats_ext (we can't use that view directly, because it formats 
the data, so the sizes are different).


But should it list just the stats the user has access to, or should it 
list everything and leave the inaccessible fields NULL?




I've reverted the commit - once we find the right way to handle this, 
I'll get it committed again.


As for how to deal with this, I can think of about three ways:

1) simplify the command to only print information from pg_statistic_ext 
(so on information about which stats are built or sizes)


2) extend pg_stats_ext with necessary information (e.g. sizes)

3) create a new system view, with necessary information (so that 
pg_stats_ext does not need to be modified)


4) add functions returning the necessary information, possibly only for 
statistics the user can access (similarly to what pg_stats_ext does)


Options 2-4 have the obvious disadvantage that this won't work on older 
releases (we can't add views or functions there). So I'm leaning towards 
#1 even if that means we have to remove some of the details. We can 
consider adding that for new releases, though.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: document the hook system

2021-01-17 Thread Magnus Hagander
On Fri, Jan 15, 2021 at 8:28 AM Peter Eisentraut
 wrote:
>
> On 2020-12-31 04:28, David Fetter wrote:
> > This could probably use a lot of filling in, but having it in the
> > actual documentation beats needing to know folklore even to know
> > that the capability is there.
>
> This patch seems quite short of a state where one could begin to
> evaluate it.  Documenting the hooks better seems a worthwhile goal.   I
> think the question is whether we can develop documentation that is
> genuinely useful by itself without studying the relevant source code.
> This submission does not address that question.

Even just having a list of available hooks would be a nice improvement though :)

But maybe it's something that belongs better in a README file instead,
since as you say it's unlikely to be properly useful without looking
at the source anyway. But just a list of hooks and a *very* high
overview of where each of them hooks in would definitely be useful to
have somewhere, I think. Having to find with "grep" whether there may
or may not exist a hook for approximately what it is you're looking
for is definitely a process to improve on.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




cgit view availabel

2021-01-17 Thread Magnus Hagander
After a short time (ahem, several years) of badgering of me my a
certain community member, I've finally gotten around to putting up a
cgit instance on our git server, to allow for browsing of the git
repositories. You can find this at:

https://git.postgresql.org/cgit/

or specifically for the postgresql git repo:

https://git.postgresql.org/cgit/postgresql.git/


For the time being we're running both this and gitweb, and all the
redirects will keep pointing to gitweb, as well as the default
redirect if you just go https://git.postgresql.org/.

If people prefer it, we can discuss changing that in the future, but
let's start with some proper full scale testing to see that it doesn't
actually just break for some people :)

//Magnus




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2021-01-17 Thread Magnus Hagander
On Thu, Jan 7, 2021 at 11:53 AM Peter Eisentraut
 wrote:
>
> After pondering this again, I think we can go with initdb
> --no-instructions, as in your patch.
>
> As a minor nitpick, I would leave out the
>
>  else
>  printf(_("\nSuccess.\n"));
>
> in the --no-instructions case.

OK, thanks. I have applied it as such, with that message moved inside
the if statement.


> (I don't know where the pg_upgrade part of this discussion is right now.)

Yeah, I think that one turned into quite a different discussion. I
think it's clear to say that the part of this patch related to
pg_upgrade was rejected - I'll drop that one. I think pg_upgrade
scripts has to be thought about as a completely separate thing, so
let's leave that for another thread.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Add session statistics to pg_stat_database

2021-01-17 Thread Magnus Hagander
On Fri, Jan 8, 2021 at 10:34 AM Laurenz Albe  wrote:
>
> On Fri, 2021-01-08 at 12:00 +0900, Masahiro Ikeda wrote:
> > 2. monitoring.sgml
> >
> > > > IIUC, "active_time" includes the time executes a fast-path function
> > > > and
> > > > "idle in transaction" includes "idle in transaction(aborted)" time.
> > > > Why don't you reference pg_stat_activity's "state" column and
> > > > "active_time" is the total time when the state is "active" and "fast
> > > > path"?
> > > > "idle in transaction" is as same too.
> > >
> > > Good idea; I have expanded the documentation like that.
> >
> > BTW, is there any reason to merge the above statistics?
> > IIUC, to separate statistics' cons is that two columns increase, and
> > there is no performance penalty. So, I wonder that there is a way to
> > separate them
> > corresponding to the state column of pg_stat_activity.
>
> Sure, that could be done.
>
> I decided to do it like this because I thought that few people would
> be interested in "time spend doing fast-path function calls"; my guess
> was that the more interesting value is "time where the database was
> busy calculating results".
>
> I tried to keep the balance between providing reasonable detail
> while not creating more additional columns to "pg_stat_database"
> than necessary.
>
> This is of course a matter of taste, and it is good to hear different
> opinions.  If more people share your opinion, I'll change the code.
>
> > There are some following codes in pgstatfuncs.c.
> > int64 result = 0.0;
> >
> > But, I think the following is better.
> > int64 result = 0;
>
> You are right.  That was a silly copy-and-paste error.  Fixed.
>
> > Although now pg_stat_get_db_session_time is initialize "result" to zero
> > when it is declared,
> > another pg_stat_XXX function didn't initialize. Is it better to change
> > it?
>
> I looked at other similar functions, and the ones I saw returned
> NULL if there were no data.  In that case, it makes sense to write
>
> char *result;
>
> if ((result = get_stats_data()) == NULL)
> PG_RETURN_NULL();
>
> PG_RETURN_TEXT_P(cstring_to_text(result));
>
> But I want to return 0 for the session time if there are no data yet,
> so I think initializing the result to 0 in the declaration makes sense.
>
> There are some functions that do it like this:
>
> int32   result;
>
> result = 0;
> for (...)
> {
> if (...)
> result++;
> }
>
> PG_RETURN_INT32(result);
>
> Again, it is a matter of taste, and I didn't detect a clear pattern
> in the existing code that I feel I should follow in this question.
>
> Version 12 of the patch is attached.

Thanks! I have applied this version, with some minor changes:

* I renamed the n__time members in the struct to just
total__time. The n_ indicates "number of" and is thus wrong for
time parameters.

* Some very minor wording changes.

* catversion bump (for once I didn't forget it!)


--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Pg14, pg_dumpall and "password_encryption=true"

2021-01-17 Thread Magnus Hagander
On Sat, Jan 16, 2021 at 8:27 AM Noah Misch  wrote:
>
> On Fri, Jan 15, 2021 at 01:35:50PM +0900, Ian Lawrence Barwick wrote:
> > $ tail -3 pg_upgrade_utility.log
> > ALTER ROLE "postgres" SET "password_encryption" TO 'true';
> > psql:pg_upgrade_dump_globals.sql:75: ERROR:  invalid value for
> > parameter "password_encryption": "true"
> > HINT:  Available values: md5, scram-sha-256.
> >
> > This is a consquence of commit c7eab0e97, which removed support for the
> > legacy
> > values "on"/"true"/"yes"/"1".
>
> Right.  This can happen anytime we reduce the domain of a setting having
> GucContext PGC_SU_BACKEND, PGC_BACKEND, PGC_SUSET, or PGC_USERSET.
>
> > I see following options to resolve this issue.
> >
> > 1. Re-add support for a "true" as an alias for "md5".
> > 2. Transparently rewrite "true" to "md5"
> > 3. Have pg_dumpall map "true" to "md5"
> > 4. Add an option to pg_dumpall to specify the target version
>
> I expect rather few databases override this particular setting in
> pg_proc.proconfig or pg_db_role_setting.setconfig.  The restore failure has a
> clear error message, which is enough.  Each of the above would be overkill.

Option 3 would be the closest to how other things work though,
wuodln't it? Normally, it's the job of pg_dump (or pg_dumpall) to
adapt the dump to the new version of PostgreSQL.

And pg_dump[all] always generates a dump that should reload in the
same version of postgres as the dump program -- not the source
postgresql. Thus doing (4) would mean changing that, and would have to
teach pg_dump[all] about *all* version differences in all directions
-- which would be a huge undertaking. Doing that for just one
individual parameter would be very strange.

In a lot of ways,  think (3) seems like the reasonable ting to do.
That's basically what we do when things change in the table creation
commands etc, so it seems like the logical place.


> > 5. Have pg_upgrade emit a warning/hint
>
> If done in a way not specific to this one setting, +1 for this.  That is to
> say, do something like the following.  Retrieve every pg_proc.proconfig and
> pg_db_role_setting.setconfig value from the old cluster.  Invoke the new
> postmaster binary to test acceptance of each value.  I'm not generally a fan
> of adding pg_upgrade code to predict whether the dump will fail to restore,
> because such code will never be as good as just trying the restore.  That
> said, this checking of GUC acceptance could be self-contained and useful for
> the long term.

Yeah, while I think (3) above would be the better choice in this
particular case, I agree something like that would be useful in a
longer term. It might not always be possible to declare a rule for
mapping values after all..


> > 6. Document this as a backwards-compatibility thing
> > 
> >
> > Add an item in the documentation (release notes, pg_upgrade, pg_dumpall)
> > stating
> > that any occurrences of "password_encryption" which are not valid in Pg14
> > should
> > be updated before performing an upgrade.
>
> The release notes will document the password_encryption change, though they
> probably won't specifically mention the interaction with pg_dump.  I would not
> document it elsewhere.

+1.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Langote
On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila  wrote:
> On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  wrote:
> > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> > > We want to do this for Inserts where only Select can be parallel and
> > > Inserts will always be done by the leader backend. This is actually
> > > the case we first want to implement.
> >
> > Sorry, I haven't looked at the linked threads and the latest patches
> > there closely enough yet, so I may be misreading this, but if the
> > inserts will always be done by the leader backend as you say, then why
> > does the planner need to be checking the parallel safety of the
> > *target* table's expressions?
> >
>
> The reason is that once we enter parallel-mode we can't allow
> parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> enter the parallel-mode at the beginning of the statement execution,
> see ExecutePlan(). So, the Insert will be performed in parallel-mode
> even though it happens in the leader backend. It is not possible that
> we finish getting all the tuples from the gather node first and then
> start inserting. Even, if we somehow find something to make this work
> anyway the checks being discussed will be required to make inserts
> parallel (where inserts will be performed by workers) which is
> actually the next patch in the thread I mentioned in the previous
> email.
>
> Does this answer your question?

Yes, thanks for the explanation.  I kind of figured that doing the
insert part itself in parallel using workers would be a part of the
end goal of this work, although that didn't come across immediately.

It's a bit unfortunate that the parallel safety check of the
individual partitions cannot be deferred until it's known that a given
partition will be affected by the command at all.  Will we need
fundamental changes to how parallel query works to make that possible?
 If so, have such options been considered in these projects?  If such
changes are not possible in the short term, like for v14, we should at
least try to make sure that the eager checking of all partitions is
only performed if using parallelism is possible at all.

I will try to take a look at the patches themselves to see if there's
something I know that will help.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Dump public schema ownership & seclabels

2021-01-17 Thread Vik Fearing
On 1/17/21 10:41 AM, Noah Misch wrote:
> On Sat, Jan 16, 2021 at 02:05:43PM +0100, Vik Fearing wrote:
>> On 12/30/20 12:59 PM, Noah Misch wrote:
>>> On Tue, Dec 29, 2020 at 05:49:24AM -0800, Noah Misch wrote:
 https://postgr.es/m/20201031163518.gb4039...@rfd.leadboat.com gave 
 $SUBJECT as
 one of the constituent projects for changing the public schema default ACL.
 This ended up being simple.  Attached.
>>>
>>> This is defective; it fails to reproduce nspacl after "ALTER SCHEMA public
>>> OWNER TO pg_write_server_files; REVOKE ALL ON SCHEMA public FROM
>>> pg_write_server_files;".  I will try again later.
>>
>> Could I ask you to also include COMMENTs when you try again, please?
> 
> That may work.  I had not expected to hear of a person changing the comment on
> schema public.  To what do you change it?

It was a while ago and I don't remember because it didn't appear in the
dump so I stopped doing it. :(

Mine was an actual comment, but there are some tools out there that
(ab)use COMMENTs as crude metadata for what they do.  For example:
https://postgresql-anonymizer.readthedocs.io/en/stable/declare_masking_rules/#declaring-rules-with-comments
-- 
Vik Fearing




Re: Dump public schema ownership & seclabels

2021-01-17 Thread Noah Misch
On Sat, Jan 16, 2021 at 02:05:43PM +0100, Vik Fearing wrote:
> On 12/30/20 12:59 PM, Noah Misch wrote:
> > On Tue, Dec 29, 2020 at 05:49:24AM -0800, Noah Misch wrote:
> >> https://postgr.es/m/20201031163518.gb4039...@rfd.leadboat.com gave 
> >> $SUBJECT as
> >> one of the constituent projects for changing the public schema default ACL.
> >> This ended up being simple.  Attached.
> > 
> > This is defective; it fails to reproduce nspacl after "ALTER SCHEMA public
> > OWNER TO pg_write_server_files; REVOKE ALL ON SCHEMA public FROM
> > pg_write_server_files;".  I will try again later.
> 
> Could I ask you to also include COMMENTs when you try again, please?

That may work.  I had not expected to hear of a person changing the comment on
schema public.  To what do you change it?




Re: Is Recovery actually paused?

2021-01-17 Thread Dilip Kumar
On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada  wrote:
>
> On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar  wrote:
> >
> > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > > >
> > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > Dilip Kumar  wrote:
> > > >
> > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > > wait.
> > > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > > blocked forever,
> > > > > > > although this setting may not be usual. In addition, some users 
> > > > > > > may set
> > > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > > pg_is_wal_replay_paused,
> > > > > > > it could wait for a long time.
> > > > > > >
> > > > > > > At least, I think we need some descriptions on document to explain
> > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > >
> > > > > > Ok
> > > > >
> > > > > Fixed this, added some comments in .sgml as well as in function header
> > > >
> > > > Thank you for fixing this.
> > > >
> > > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > > "Pauses recovery." to "Request to pause recovery." in according with
> > > > pg_is_wal_replay_paused?
> > >
> > > Okay
> > >
> > > >
> > > > > > > Also, how about adding a new boolean argument to 
> > > > > > > pg_is_wal_replay_paused to
> > > > > > > control whether this waits for recovery to get paused or not? By 
> > > > > > > setting its
> > > > > > > default value to true or false, users can use the old format for 
> > > > > > > calling this
> > > > > > > and the backward compatibility can be maintained.
> > > > > >
> > > > > > So basically, if the wait_recovery_pause flag is false then we will
> > > > > > immediately return true if the pause is requested?  I agree that it 
> > > > > > is
> > > > > > good to have an API to know whether the recovery pause is requested 
> > > > > > or
> > > > > > not but I am not sure is it good idea to make this API serve both 
> > > > > > the
> > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > >
> > > >
> > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > purpose;
> > > > this waits recovery to actually get paused. If we want to limit this 
> > > > API's
> > > > purpose only to return the pause state, it seems better to fix this to 
> > > > return
> > > > the actual state at the cost of lacking the backward compatibility. If 
> > > > we want
> > > > to know whether pause is requested, we may add a new API like
> > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait recovery 
> > > > to actually
> > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > purpose.
> > > >
> > > > However, this might be a bikeshedding. If anyone don't care that
> > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > > care either.
> > >
> > > I don't think that it will be blocked ever, because
> > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > recovery process will not be stuck on waiting for the WAL.
> > >
> > > > > > > As another comment, while pg_is_wal_replay_paused is blocking, I 
> > > > > > > can not cancel
> > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > > waiting loop.
> > > >
> > > > How about this fix? I think users may want to cancel 
> > > > pg_is_wal_replay_paused() during
> > > > this is blocking.
> > >
> > > Yeah, we can do this.  I will send the updated patch after putting
> > > some more thought into these comments.  Thanks again for the feedback.
> > >
> >
> > Please find the updated patch.
>
> I've looked at the patch. Here are review comments:
>
> +   /* Recovery pause state */
> +   RecoveryPauseState  recoveryPause;
>
> Now that the value can have tri-state, how about renaming it to
> recoveryPauseState?

This makes sense to me.

> ---
>  bool
>  RecoveryIsPaused(void)
> +{
> +   boolrecoveryPause;
> +
> +   SpinLockAcquire(>info_lck);
> +   recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ?
> true : false;
> +   SpinLockRelease(>info_lck);
> +
> +   return recoveryPause;
> +}
> +
> +bool
> +RecoveryPauseRequested(void)
>  {
> boolrecoveryPause;
>
> SpinLockAcquire(>info_lck);
> -   recoveryPause = XLogCtl->recoveryPause;
> +   recoveryPause = (XLogCtl->recoveryPause !=
> RECOVERY_IN_PROGRESS) ? true : false;
> SpinLockRelease(>info_lck);
>
> return recoveryPause;
>  }
>
> We can write like recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED);

In RecoveryPauseRequested, we just want to know whether the pause is
requested or not, even if the pause requested and not yet pause then
also we want to return true. So how
recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) will work?

>